We’re looking for an experiencedto join our infrastructure team. In this role, you’ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.
Responsibilities
- Reliability & Availability : Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.
- Observability : Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.
- Performance Optimization : Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).
- Automation & Tooling : Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU GPU environments.
- Incident Management : Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.
- Security & Compliance : Ensure data privacy, compliance, and secure operations across model training and serving environments.
- Collaboration : Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.
Required Qualifications
- 4 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Other Qualifications
- Strong proficiency in Kubernetes, Docker, and container orchestration .
- Knowledge of CI/CD pipelines for Inference and ML model deployment.
- Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.
- Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.).
- Strong programming/scripting skills in Python, Go, or Bash .
- Solid knowledge of distributed systems, networking, and storage .
- Experience running large-scale GPU clusters for ML/AI workloads (preferred).
Preferred Qualifications
- Familiarity with ML training/inference pipelines.
- Experience with high-performance computing (HPC) and workload schedulers ( Kubernetes operators).
- Background in capacity planning & cost optimization for GPU-heavy environments.
- Work on cutting-edge infrastructure that powers the future of Generative AI.
- Collaborate with world-class researchers and engineers.
- Impact millions of users through reliable and responsible AI deployments.
- Competitive compensation, equity options, and comprehensive benefits.