Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

PayPal Staff Site Reliability Engineer 
United States, California, San Jose 
212570800

02.09.2025


This job identifies issues and recommends best practices to enhance system reliability. They lead functional projects, analyze business trends, and contribute to process improvements while providing guidance to junior engineers.

Essential Responsibilities:

  • Manage and deliver large-scale reliability improvement projects, ensuring systems are performant, available, and resilient.
  • Drive the identification of performance bottlenecks and lead initiatives to optimize and scale critical systems and services.
  • Architect and implement scalable infrastructure solutions to support growing user demands while maintaining system reliability.
  • Lead the design and enhancement of monitoring frameworks, ensuring systems are highly observable, and support the response to production incidents.
  • Take ownership of improving system resilience by designing fault-tolerant architectures and implementing disaster recovery strategies.
  • Lead capacity planning initiatives to ensure system resources are proactively managed, preventing downtime or performance degradation under high load.
  • Work closely with development, operations, and other technical teams to ensure seamless system integration and align on best practices for reliability.
  • Act as a technical mentor within the organization, guiding teams through complex reliability challenges and promoting a culture of excellence.
  • Help define and execute long-term reliability engineering strategies and standards to ensure the scalability and performance of core services.
  • Develop and enforce best practices for operational excellence, including automation, incident management, and system monitoring, across engineering teams.

Minimum Qualifications:

  • Minimum of 8 years of relevant work experience and a Bachelor's degree or equivalent experience.

Preferred Qualification:

  • 8+ years in Cloud Infrastructure, Site Reliability Engineering (SRE), DevOps Engineering, or related fields
  • B.S. or M.S. degree in Computer Science, Engineering, or a related technical field, or equivalent experience may be considered in lieu of degree.
  • At least 4+ years of hands-on experience deploying, managing, and optimizing containerized applications using GKE, and Harness in both public and private cloud environments (AWS, GCP, Azure, etc.), preferably Google Cloud Platform (GCP).
  • 4+ years of hands-on experience withInfrastructure-as-code(Terraform, CloudFormation), CI/CD pipelines (CircleCI, Harness, Jenkins, ArgoCD), and experience in Node, Python, or Go.
  • Strong understanding of using Google Cloud Logging, DataDog, or other monitoring and observability tools.
  • Ability to effectively diagnose and resolve performance bottlenecks within GCP at the infrastructure and application layers.
  • Strong leadership abilities; must havecustomer focus and commitment to quality.
  • Must have great interpersonal skills; solid communication skills, written and verbal.
  • Ability to remain composed, methodical, and think fast in a high-pressure environment.
  • Experience in managing, collaborating, and influencing global teams.
  • Must be organized, detail-oriented, and able to manage multiple tasks simultaneously with the ability to appropriately prioritize.

Your day to day:

  • Own and enhance the reliability of services deployed across various cloud regions. You will proactively monitor, automate, and scale services to ensure seamless uptime and performance with an eye on cost.
  • Foster and advocate for a DevOps culture that emphasizes automation, self-service, and engineering excellence. Enable development teams to manage and deploy applications seamlessly with minimal intervention.
  • Lead the containerization, deployment, and scaling of microservices and data pipelines on Google Kubernetes Engine (GKE), with a strong emphasis on reliability and fault tolerance.
  • Set up and manage cloud infrastructure using Terraform enabling automated, repeatable provisioning and management of cloud infrastructure.
  • Continuously enhance and automate alerting, incident detection, and recovery mechanisms for critical applications and services to minimize downtime and improve system reliability.
  • Participate in an on-call rotation to meet business SLAs, quickly troubleshoot and resolve issues, and document runbooks for consistent incident response processes.
  • Work closely with Product Owners, Engineering Managers, and cross-functional teams in Agile Scrum and Kanban workflows to deliver iterative improvements and meet evolving business needs.
  • Perform impact analysis during incidents, collaborate with teams for root cause analysis, and implement preventive measures to avoid recurrence.
  • Champion a service-first mindset while supporting engineering teams, swiftly addressing their needs and clearing blockers to help them maintain development velocity on a weekly basis.

Travel Percent:

The total compensation for this practice may include an annual performance bonus (or other incentive compensation, as applicable), equity, and medical, dental, vision, and other benefits. For more information, visit .

The US national annual pay range for this role is $137,500 to $236,500


Our Benefits:

Any general requests for consideration of your skills, please