Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Tesla Staff Site Reliability Engineer AI Platform 
United States, California, Palo Alto 
835592533

10.04.2025
What to Expect

As a Site Reliability Engineer (SRE) for the AI Platform team, you will manage bleeding-edge bare-metal servers for Tesla's advanced generative AI platform. You will be responsible for the imaging, configuration management, observability, security, and scalability of these systems. You'll also manage the model benchmarks and their outputs. You should have a focus on automating anything required of this AI platform team and use various platforms to make it as easy as possible for the software engineers on the team to run their services reliably on the bare-metal platform.

What You’ll Do
  • Help image bare-metal servers
  • Building tooling around it, evaluating its usage, and helping to ensure its reliability, availability and security
  • Design software and systems that enable the generative AI platform at Tesla
  • Assist the AI Platform team with onboarding and integrating services into the Tesla stack (Kubernetes/VMWare/Bare-metal)
  • Ensuring best practices and observability of the service, such as metrics, logging, tracing, and alerting
  • Automate configuration and deployment of services
  • Consult on and design infrastructure, systems and software architecture
What You’ll Bring
  • Experience with bare-metal imaging and management
  • Expert skills in Linux and its administration (Ubuntu 22.04/24.04)
  • Experience in a high-level language such as Go, Python and/or Java
  • Observability (OpenTelemetry, Prometheus, AlertManager, Grafana, Jaeger, and Splunk)
  • Infrastructure as Code (Ansible) and CI/CD pipeline experience (GitHub Actions, Jenkins)
  • Artifact management (Artifactory)
  • Strong bias for action vs endless planning, willing to get hands dirty and make mistakes sometimes
  • Habitual documenter and spreader of knowledge
  • Willing to mentor other team members and engineers with less SRE type knowledge
  • Comfortable on an on-call rotation and doing live troubleshooting of issues on NOC bridges/outage calls