Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Tesla Observability Software Engineer AI Infrastructure 
United States, California, Fremont 
980948315

06.04.2025
What You’ll Do
  • Design, develop and maintain observability solutions & tools, including monitoring, logging, and alerting systems, to improve system visibility and performance
  • Create dashboards & automated alerts using tools such as Grafana, Prometheus, Splunk, Catchpoint to enhance monitoring frameworks & ensure proactive issue detection and resolution
  • Analyze system metrics and logs to identify bottlenecks, optimize application performance, and ensure system reliability end-to-end while scaling
  • Partner with developers, DevOps engineers, and AI Infra teams to integrate observability best practices into the development and deployment lifecycle
  • Assist in troubleshooting and resolving production issues by leveraging observability data to identify root causes and implement preventative measures
  • Develop scripts or workflows to automate routine tasks and improve observability tool integrations
  • Create and maintain documentation for observability tools, processes, and workflows to ensure knowledge sharing and accessibility
What You’ll Bring
  • 3+ years of experience in software engineering, DevOps, or SRE roles with a focus on observability or monitoring
  • Proficiency in monitoring and visualization tools (e.g., Prometheus, Grafana, Splunk, Catchpoint)
  • Strong analytical and troubleshooting skills with a focus on system performance and reliability
  • Working knowledge for High performance computing, Slurm, GPU architecture & Networking
  • Working knowledge of logging systems and distributed tracing frameworks such as OpenTelemetry
  • Expertise in scripting languages (e.g., Python, Bash) and familiarity with configuration management tools (e.g., Terraform, Ansible)
  • Experience with containerized environments (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, Azure)
  • Strong analytical and troubleshooting skills with a focus on system performance and reliability
  • Excellent verbal and written communication skills, with the ability to collaborate effectively across teams
  • Bachelor’s Degree in Computer Science, Software Engineering, or a related field, or equivalent experience