Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

KLA MLOps Site Reliability Engineer 
India, Tamil Nadu, Chennai 
913766146

Today

Responsibilities:
  • Design, implement, and maintain scalable and reliable machine learning infrastructure.
  • Collaborate with data scientists and machine learning engineers to deploy and manage machine learning models in production.
  • Develop and maintain CI/CD pipelines for machine learning workflows.
  • Monitor and optimize the performance of machine learning systems and infrastructure.
  • Implement and manage automated testing and validation processes for machine learning models.
  • Ensure the security and compliance of machine learning systems and data.
  • Troubleshoot and resolve issues related to machine learning infrastructure and workflows.
  • Document processes, procedures, and best practices for machine learning operations.
  • Stay up-to-date with the latest developments in MLOps and related technologies.
Required Qualifications:
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Proven experience as a Site Reliability Engineer (SRE) or in a similar role.
  • Strong knowledge of machine learning concepts and workflows.
  • Proficiency in programming languages such as Python, Java, or Go.
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud.
  • Familiarity with containerization technologies like Docker and Kubernetes.
  • Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI.
  • Strong problem-solving skills and the ability to troubleshoot complex issues.
  • Excellent communication and collaboration skills.
Preferred Qualifications:
  • Master's degree in Computer Science, Engineering, or a related field.
  • Experience with machine learning frameworks such as TensorFlow, PyTorch, or Scikit-learn.
  • Knowledge of data engineering and data pipeline tools such as Apache Spark, Apache Kafka, or Airflow.
  • Experience with monitoring and logging tools such as Prometheus, Grafana, or ELK stack.
  • Familiarity with infrastructure as code (IaC) tools like Terraform or Ansible.
  • Experience with automated testing frameworks for machine learning models.
  • Knowledge of security best practices for machine learning systems and data.

Minimum Qualifications

Master's / Bachelor's Level Degree and related work experience of 2 years