Finding the best job has never been easier

Tesla Software Engineer Generalist AI Infrastructure
United States, California, Palo Alto
148078422

23.06.2024

What to Expect

As a Software Engineer within Autopilot, you will work on reinforcing, optimizing, and scaling our neural network training infrastructure.

What You’ll Do

Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs
Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests
Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time
Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning

What You’ll Bring

Practical experience programming in Python and/or C/C++
Proficient in system-level software, in particular hardware-software interactions and resource utilization
Understanding of modern machine learning concepts and state of the art deep learning
Experience working with training frameworks, ideally PyTorch
Demonstrated experience scaling neural network training jobs across clusters of GPU’s
Experience programming in Cuda
Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc.)
Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data

These jobs might be a good fit

Nvidia AI K8s Infrastructure Generalist United States, Texas

Get to the top of the "yes list" with a standout CV!

CREATE CV