Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Tesla HPC Engineer AI Infrastructure 
United States, California, Palo Alto 
78093135

13.08.2024
What You’ll Do
  • Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale
  • Improve our monitoring & self-healing pipelines, as well as security posture
  • Work with hardware and storage vendors to tune and optimize our server, storage and network performance
  • Performance tuning & OS provisioning on Linux systems
  • Manage HPC clusters, workloads and applications
  • Automation and systems engineering
  • Participate in 24x7 on-call rotation
What You’ll Bring
  • Proficiency with scripting languages such as Python or Bash
  • Proficiency with Linux & network fundamentals
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.) is a plus
  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems is a plus
  • Experience with Slurm, LSF and storage management of parallel file systems is a plus
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position