Expoint - all jobs in one place

המקום בו המומחים והחברות הטובות ביותר נפגשים

Limitless High-tech career opportunities - Expoint

Tesla Sr. Site Reliability Engineer Dojo 
United States, California, Palo Alto 
11539926

10.04.2025
What You’ll Do
  • Respond to customer inquiries and resolve issues in a timely and professional manner
  • Manage and prioritize change requests, ensuring minimal disruption to cluster operations
  • Collaborate with third-party storage vendors to resolve issues and outages
  • Troubleshoot and debug storage-related problems, ensuring prompt resolution and minimal downtime
  • Work with network vendors to debug and resolve issues, improving overall network reliability
  • Create visibility into network issues, developing and implementing monitoring and reporting tools to enhance transparency
  • Collaborate with facility and operations teams to plan and execute maintenance, upgrades, and shutdowns
  • Ensure seamless communication and coordination during planned and unplanned outages
  • Troubleshoot and debug hardware issues through automation, identifying root causes and implementing fixes
  • Develop and implement automation scripts to improve hardware monitoring and maintenance
What You’ll Bring
  • 3+ years of experience in a similar SRE or infrastructure engineering role
  • Strong understanding of Linux, networking, and storage systems
  • Excellent problem-solving and troubleshooting skills, with the ability to debug complex issues
  • Experience with automation tools, such as Ansible, Python, or similar
  • Strong communication and collaboration skills, with the ability to work with various teams and vendors
  • Ability to work in a fast-paced environment, with a focus on delivering high-quality results
  • Familiarity with monitoring and logging tools, such as Prometheus, Grafana, or ELK preferred
  • Experience with cloud-based infrastructure preferred