Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Apple Senior Compute Site Reliability Engineer GPU
United States, Washington, Seattle
630602393

07.04.2025

As a Site Reliability Engineer you will be responsible for providing the platform for mission critical cloud systems to maintain constant uptime, scale seamlessly, and allow for new applications and services to flourish.The successful candidate will be highly self-motivated with a passion for excellence, quality and detail. The SRE will not only support operations, but also work closely with the developers and architects within the team to aid in the design and assist with the implementation to improve stability, security and scalability.As an SRE in this team, you will:* Design and deploy GPU-accelerated VM and container infrastructure using platforms such as KVM, Qemu, AWS, or Google Cloud.* Implement GPU-based Kubernetes clusters to support containerized applications and services* Work with data scientists, developers, and other stakeholders to understand requirements and provide solutions for GPU-accelerated tasks.* Implement best practices for security, scalability, and high availability environments.* Monitor and optimize resource utilization to ensure performance and cost-efficiency.* Actively participate in capacity planning, scale testing, and disaster recovery exercises.* Able to troubleshoot issues across the entire infrastructure stack* Cultivate and maintain relationships with internal and external third-party vendors.

5+ years in a Site Reliability Engineering, DevOps, or Infrastructure focused role
Proven experience with GPU-based virtual machine infrastructure and cloud platforms (e.g., AWS, GCP).
Experience with GPU hardware (e.g., NVIDIA, AMD) and associated software stack (e.g., CUDA, cuDNN).
Experience with GitOps, CI/CD tools, and deployment strategies like Spinnaker, Argo
Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana, and Prometheus
Outstanding organizational and communications skills
BS/MS degree (Engineering or Computer Science) or equivalent work experience

Strong verbal and written communication skills
Knowledge of Kubernetes, including deployment, management, and optimization of clusters.
Automation advocate - you truly believe in removing operational load via software.
A strong sense of ownership. At the same time, you're a great teammate who communicates clearly and transparently - Self-motivated, inquisitive, and always looking to learn more.
Experience managing, scaling, and troubleshooting Golang and GPU applications.
Ability to work independently and manage multiple priorities effectively.
CNCF Kubernetes Administration certification

Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.

Full job details

These jobs might be a good fit

Nvidia Senior Site Reliability Engineer - GPU Clusters United States, Texas

Apple Compute Site Reliability Engineer SRE - Kubernetes United States, Washington, Seattle

Nvidia Senior Site Reliability Engineer - GPU Cloud India, Karnataka, Bengaluru

Google Site Reliability Engineer Quality Compute India, Karnataka, Bengaluru

Professional CV Builder tool from Expoint.

Get to the top of the "yes list" with a standout CV!

CREATE CV