Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Apple Site Reliability Engineering Lead SAP Global Systems 
India, Telangana, Hyderabad 
426221165

03.08.2025
RESPONSIBILITIES:- Build up, lead and improve existing processes to provide 24x7 operational response for applications in public cloud platforms. - Maintain services once they are live by setting up monitoring, alerting and measuring availability, latency, and overall system health. - Own and review work for accuracy, quality, application performance and completeness. - Review release readiness through activities such as system design consulting, reviewing all observability and monitoring, capacity planning, and launch reviews. - Understanding of Core Principles of DevSecOps.- Partner with architects and engineers to design and implement automation, operations, and support solutions. - Partner Management- Proficient in designing and implementing end-to-end observability frameworks using tools such as Prometheus, Grafana, CloudWatch, ELK/EFK, and OpenTelemetry, ensuring service reliability through dashboard design, SLOs/SLIs, and alerting systems.
  • 8 - 14 years of experience with a track record of building and leading Cloud Native SRE and Operations for AWS or GCP Hyperscalers.
  • Solid experience supporting customer facing applications in an 24-7 uptime environment of distributed systems.
  • Bachelor's degree or equivalent experience in Computer Science, Engineering or other relevant major.
  • Collaborate with security, development, and infrastructure teams to implement a Zero Trust Architecture, handle secrets securely, and establish secure CI/CD pipelines.
  • Expertise in SRE principles, production-scale system design, and DevOps practices.
  • Design / Architect the Solutions on Multi Cloud Environments / OnPrem systems.
  • Solid understanding of core cloud services such as IAM, EC2/GCE, RDS/CloudSQL, EKS/GKE, CloudWatch/Cloud Monitoring, S3/GCS etc
  • Understand complex landscape architectures. Have working knowledge of on-prem and cloud based hybrid architectures and infrastructure concepts of Regions, Availability Zones, VPCs/Subnets, Load balancers, API Gateways etc.
  • Good understanding of common authentication schemes, certificates, secrets and protocols.
  • Implement infrastructure-as-code practices applying tools such as Terraform, Helm, or Pulumi.
  • Scripting and/or coding skills needed for automation, triaging and troubleshooting . Experience on any of these scripting Python, Go, Java etc.
  • Experience with Planning and Designing the Disaster Recovery for BCP and Non BCP Applications.
  • Core Knowledge on the Standard processes of Security and Governance.
  • Expertise handling production incidents, with experience working towards resolution and collaborator communication during incidents.
  • Track record with improving service reliability and efficiency.
  • Ability to implement and coordinate telemetry using monitoring and observability tools
  • Adapt at prioritizing multiple issues in a high stress environment. Good experience in designing and improving response processes
  • Mentor and foster professional development of junior SREs, thereby contributing to operational excellence across diverse environments.
  • Automation focus for operational efficiency - designing and implementing automation processes for repeatable and consistent service deployment
  • A solid sense of ownership. critical thinking & interpersonal skills to work effectively across diverse & multi-functional teams.
  • Certifications like AWS Solutions Architect, AWS DevOps Professional, GCP Professional Architect is a plus.