Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Apple Site Reliability Engineer Software Development 
United States, West Virginia 
101233289

05.09.2025
  • Ensure System Reliability: Design, build, and maintain robust, scalable, and observable systems for our core software delivery services.
  • Automate: Reduce operational toil by developing automation and tooling to prevent and rapidly resolve production issues.
  • Improve Incident Response: Own and refine our incident management processes to ensure high availability.
  • Collaborate with Engineers: Partner with development teams to create elegant, high-quality solutions that support the entire workflow, from source code to customer release.
  • Improve and Modernize Systems: Use a proactive approach to identify and eliminate technical debt to enhance long-term reliability and maintainability.
  • Experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer focused on infrastructure in a large-scale distributed environment.
  • Strong software development skills in a language like Swift, Go, or Python, and a high degree of comfort with shell scripting (Bash).
  • Hands-on experience building and managing systems with container orchestration tools (Kubernetes, Docker).
  • Deep understanding of networking (TCP/IP, DNS, HTTP) and experience using observability tools (monitoring, logging, tracing) to diagnose complex issues.
  • Excellent problem-solving and communication skills, with a strong sense of ownership and drive.
  • Proven experience leading initiatives to reduce technical debt, refactor systems, or improve performance and latency.
  • Expertise in performance analysis and capacity planning for global, distributed systems.
  • Experience with large-scale distributed databases (e.g., Cassandra, FoundationDB) or messaging systems (e.g., Kafka).
  • Demonstrated ability to lead incident response for high-impact outages.
  • Familiarity with using Generative AI (GenAI) or Large Language Models (LLMs) to accelerate operational tasks, such as automating runbooks, generating scripts, or analyzing incident data.