Design, build, and manage robust ML pipelines for training, validation, and deployment
Build and maintain scalable infrastructure for ML experiments and inference in multiple public cloud
Implement CI/CD for ML systems ensuring reproducibility and traceability
Advocate automation in every layer of the infrastructure stack using Infrastructure as Code (IaC) principles and tools such as Terraform, Helm, and GitOps frameworks.
Monitor models in production for performance degradation, drift, and fairness
Participate in on-call rotation for ML Operations
Work closely with data scientists, engineers, and product managers to understand requirements and integrate models into applications
Minimum Qualifications
Bachelors degree in Comp Science, Engineering (or related field /industry) + 8 years of DevOps experience, Masters + 6 years of related experience, or PhD + 3 years of related experience.
Strong understanding of CI/CD pipelines and automation tools.
Knowledge of cloud platforms (AWS, Azure, GCP)
Proficiency in Python and familiarity with ML libraries (e.g., Scikit-learn, PyTorch, TensorFlow, etc.)
Strong understanding of ML lifecycle management and model versioning
Preferred Qualifications:
Experience deploying large language models (LLMs) or generative AI systems
Familiarity with feature stores, vector databases, or data observability platforms
Excellent communication, collaboration, and mentoring skills.
Deep expertise in CI/CD tooling and practices, including hands-on experience with systems like Jenkins, GitLab, ArgoCD, or similar.
Strong proficiency in Kubernetes, Docker, and cloud-native patterns in AWS, Azure, or GCP.