Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Tesla Sr. Site Reliability Engineer Machine Learning Operations Infrastructure 
United States, Texas, Austin 
853177798

23.04.2025
What You’ll Do
  • Mature our Machine Learning Operations Platform and advocate best practices to MLops engineers and design and implement scalable, automated workflows for the complete ML lifecycle
  • Maintain Kubernetes-based infrastructure for model training, deployment, and monitoring
  • Develop solutions for workload orchestration and time-slicing using tools like Flyteand Ray
  • Implement and optimize CI/CD pipelines tailored for machine learning applications
  • Leverage GPU capabilities, including MIG, to maximize efficiency for AI/ML workloads
  • Set up model monitoring systems to track performance, ensure robustness, and scale workloads as needed
  • Collaborate with engineers to build and maintain robust, pipelines for training and inference workflows
  • Develop Infrastructure-as-Code (IaC) solutions for deploying and managing cloud/on-prem ML environments
  • Design and develop intuitive, user-friendly self-service portals using React to enable data scientists and engineers to manage ML pipelines, monitor models, and access resources seamlessly
  • Participate in 24x7 on-call rotation
What You’ll Bring
  • Strong hands-on experience with tools and frameworks like Kubernetes, Kubeflow, MLflow, Flyte, / Ray
  • Proven experience with React for building interactive web applications, especially self-service portals that enhance the user experience for managing ML pipelines and workflows
  • Expertise in MIG, time-slicing, and scaling AI workloads efficiently
  • Proficiency in Python, Golang and bash for pipeline development, and automation
  • Model Deployment and Serving: Tensorflow Serving, TorchServe, FastAPI, Flask,REST/gRPC on scalable architectures
  • Proficiency with Linux fundamentals and performance optimizations
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
  • Strong analytical and problem-solving abilities to troubleshoot and optimize AI/ML systems
  • Ability to collaborate with cross-functional teams, including data scientists, data engineers, and DevOps engineers, to deliver high-quality solutions.Excellent troubleshooting skills in production
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field or equivalent experience