

Share
first point of contact (L1) for all support requests related to the AI/ML Platform, including ML training, inference, model deployment, and GPU allocation.
Provide operational and on-call (PagerDuty) support for Ray.io and Kubernetes clusters running distributed ML workloads across cloud and on-prem environments.
Monitor, triage, and resolve platform incidents involving job failures, scaling errors, cluster instability, or GPU resource contention.
Manage GPU quota allocation and scheduling across multiple user teams, ensuring compliance with approved quotas and optimal resource utilization.
Support Ray Train/Tune for large-scale distributed training and Ray Serve for autoscaled inference, maintaining performance and service reliability.
Troubleshoot Kubernetes workloads , including pod scheduling, networking, image issues, and resource exhaustion in multi-tenant namespaces.
Collaborate with platform engineers, SREs, and ML practitioners to resolve infrastructure, orchestration, and dependency issues impacting ML workloads.
Improve observability, monitoring, and alerting for Ray and Kubernetes clusters using Prometheus, Grafana, and OpenTelemetry to enable proactive issue detection.
Maintain and enhance runbooks, automation scripts, and knowledge base documentation to accelerate incident resolution and reduce recurring support requests.
Participate in root cause analysis (RCA) and post-incident reviews
Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical discipline (or equivalent experience).
5+ years of experience in ML operations, DevOps, or platform support for distributed AI/ML systems.
Proven experience providing L1/L2 and on-call support for Ray.io and Kubernetes-based clusters supporting ML training and inference workloads.
Strong understanding of Ray cluster operations , including autoscaling, job scheduling, and workload orchestration across heterogeneous compute (CPU/GPU/accelerators).
Hands-on experience managing Kubernetes control plane and data plane components , multi-tenant namespaces, RBAC, ingress, and resource isolation.
Expertise in GPU scheduling, allocation, and monitoring (NVIDIA device plugin, MIG configuration, CUDA/NCCL optimization).
Proficiency in Python and/or Go for automation, diagnostics, and operational tooling in distributed environments.
Working knowledge of Kubernetes and cloud-native environments (AWS, GCP, Azure) and CI/CD pipelines .
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident management tools (PagerDuty, ServiceNow).
Familiarity with ML frameworks such as TensorFlow and PyTorch, and their integration within distributed Ray/Kubernetes clusters.
Strong debugging, analytical, and communication skills to collaborate effectively with cross-functional engineering and research teams.
A customer-centric, operationally disciplined mindset focused on maintaining platform reliability, performance, and user satisfaction .
These jobs might be a good fit