Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Ebay ML Platform Engineer 
India, Karnataka, Bengaluru 
289436086

25.11.2025
What you will accomplish
  • first point of contact (L1) for all support requests related to the AI/ML Platform, including ML training, inference, model deployment, and GPU allocation.

  • Provide operational and on-call (PagerDuty) support for Ray.io and Kubernetes clusters running distributed ML workloads across cloud and on-prem environments.

  • Monitor, triage, and resolve platform incidents involving job failures, scaling errors, cluster instability, or GPU resource contention.

  • Manage GPU quota allocation and scheduling across multiple user teams, ensuring compliance with approved quotas and optimal resource utilization.

  • Support Ray Train/Tune for large-scale distributed training and Ray Serve for autoscaled inference, maintaining performance and service reliability.

  • Troubleshoot Kubernetes workloads , including pod scheduling, networking, image issues, and resource exhaustion in multi-tenant namespaces.

  • Collaborate with platform engineers, SREs, and ML practitioners to resolve infrastructure, orchestration, and dependency issues impacting ML workloads.

  • Improve observability, monitoring, and alerting for Ray and Kubernetes clusters using Prometheus, Grafana, and OpenTelemetry to enable proactive issue detection.

  • Maintain and enhance runbooks, automation scripts, and knowledge base documentation to accelerate incident resolution and reduce recurring support requests.

  • Participate in root cause analysis (RCA) and post-incident reviews


What you will bring
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical discipline (or equivalent experience).

  • 5+ years of experience in ML operations, DevOps, or platform support for distributed AI/ML systems.

  • Proven experience providing L1/L2 and on-call support for Ray.io and Kubernetes-based clusters supporting ML training and inference workloads.

  • Strong understanding of Ray cluster operations , including autoscaling, job scheduling, and workload orchestration across heterogeneous compute (CPU/GPU/accelerators).

  • Hands-on experience managing Kubernetes control plane and data plane components , multi-tenant namespaces, RBAC, ingress, and resource isolation.

  • Expertise in GPU scheduling, allocation, and monitoring (NVIDIA device plugin, MIG configuration, CUDA/NCCL optimization).

  • Proficiency in Python and/or Go for automation, diagnostics, and operational tooling in distributed environments.
    Working knowledge of Kubernetes and cloud-native environments (AWS, GCP, Azure) and CI/CD pipelines .

  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident management tools (PagerDuty, ServiceNow).

  • Familiarity with ML frameworks such as TensorFlow and PyTorch, and their integration within distributed Ray/Kubernetes clusters.

  • Strong debugging, analytical, and communication skills to collaborate effectively with cross-functional engineering and research teams.

  • A customer-centric, operationally disciplined mindset focused on maintaining platform reliability, performance, and user satisfaction .