Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Tesla Staff Site Reliability Engineer Internal Tools Infrastructure 
United States, Texas, Austin 
326477944

11.04.2025
What You’ll Do
  • Architect, implement, and maintain advanced automation solutions for provisioning, configuration, and monitoring of engineering tools infrastructure, ensuring scalability, resilience, and high availability
  • Oversee and support the Atlassian application stack (Jira, Confluence, Bitbucket), GitHub, Artifactory and Polarion, taking ultimate accountability for uptime, performance, and user satisfaction. Manage configurations, OSLC plugin integrations, workflows, reports, templates, permissions, re-indexing, and restoration processes
  • Lead, mentor, and manage a team of SREs and technical contributors, fostering a culture of accountability, technical excellence, and collaboration. Delegate tasks effectively, set clear goals, and provide guidance to ensure successful project delivery and operational stability
  • Partner with Github, Artifactory, Polarion, Mattermost and Atlassian tool users to promptly address issues, gather feedback, and implement solutions. Work closely with development and operations teams to integrate engineering tools seamlessly into CI/CD pipelines
  • Participate in and oversee an on-call rotation, driving rapid incident response and resolution to minimize downtime. Lead post-incident reviews to identify root causes and implement preventive measures
  • Monitor system health, troubleshoot complex issues, and deploy proactive strategies to prevent disruptions. Conduct performance analysis and capacity planning to anticipate future needs and optimize resource utilization
  • Manage regular backups, upgrades, and patch cycles for engineering tools, ensuring compliance with security standards and operational stability
  • Develop and maintain comprehensive documentation, runbooks, and best practices. Promote knowledge sharing within the team and across the organization to enhance tool adoption and administration efficiency
  • Collaborate with leadership to define long-term strategies for tool infrastructure, aligning with organizational growth and technical objectives. Assess and integrate new technologies to enhance reliability and efficiency
  • Drive the adoption of automation frameworks and modern practices, mentoring the team in scripting and tool development to reduce manual effort and improve system reliability
What You’ll Bring
  • Bachelor’s Degree in Computer Science, Information Technology, or a related field (or equivalent experience)
  • Extensive experience in the installation, configuration, development, debugging, support, and upgrades of GitHub Enterprise and Atlassian tools (Jira, Confluence, Bitbucket)
  • Proficiency in managing and automating Confluence Spaces, permissions, and Jira projects, with a track record of optimizing user workflows
  • Deep knowledge of Polarion administration, including templates, workflows, permissions, OSLC integrations, and High Availability setups (HA experience highly desirable)
  • Strong programming and scripting skills (Python, Shell, Golang) with hands-on experience in automation frameworks like Ansible for administration, monitoring, and custom plugin/workflow development
  • Expertise in containerization (Docker) and orchestration (Kubernetes), with practical application in production environments
  • Familiarity with monitoring and logging tools such as Prometheus, Grafana, and Splunk to ensure observability and performance insights
  • Proven ability to diagnose and resolve cooplex issues across storage, OS, network, virtualization, and application/database stacks
  • Demonstrated experience leading technical teams, with a focus on mentoring, coaching, and fostering professional growth
  • Strong project management skills, with the ability to prioritize tasks, manage resources, and deliver on deadlines in a fast-paced environment