Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Tesla Software Engineer Generalist AI Infrastructure 
United States, California, Palo Alto 
148078422

23.06.2024
What to Expect

As a Software Engineer within Autopilot, you will work on reinforcing, optimizing, and scaling our neural network training infrastructure.

What You’ll Do
  • Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs
  • Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests
  • Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time
  • Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning
What You’ll Bring
  • Practical experience programming in Python and/or C/C++
  • Proficient in system-level software, in particular hardware-software interactions and resource utilization
  • Understanding of modern machine learning concepts and state of the art deep learning
  • Experience working with training frameworks, ideally PyTorch
  • Demonstrated experience scaling neural network training jobs across clusters of GPU’s
  • Experience programming in Cuda
  • Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc.)
  • Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data