Familiarity with GPU hardware and its architecture (e.g., NVIDIA, AMD and understanding how GPUs work in high- performance computing (HPC) and machine learning environments.
Ability to write custom software tools that streamline the development process, enhance hardware performance analysis, and troubleshoot complex GPU hardware issues.
Experience managing GPUs in a data center or cloud-based environments (like AWS, GCP, or Azure).
Clear written and verbal communication skills to document findings, procedures, and collaborate across teams.
Outstanding organizational and communications skills.
Bachelor's Degree in Computer Science, an engineering-related field, or equivalent related experience.
5+ years in a Operations, DevOps, or Infrastructure focused role.
Understanding of TCP/IP, HTTP/S, and protocols specific to high-performance computing (HPC) systems, such as RDMA (Remote Direct Memory Access).
Experience with GPU hardware (e.g., NVIDIA, AMD) and strong programming skills in languages like Python, Golang, or CUDA to create custom software tools, automation scripts, and diagnostic utilities.
Understanding of GPU performance optimization, load balancing, and managing memory and compute resources for optimal performance.
Understanding of base internet infrastructure services including DNS, DHCP, LDAP, server virtualization in critical large scale distributed systems.
Experience with monitoring systems (e.g., Prometheus, Grafana) to track GPU, CPU, and memory utilization, as well as tools to diagnose and optimize hardware performance.
Familiarity with configuration management and fleet orchestration via Chef, Ansible, or others.
Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.