As a Software Engineer within Autopilot, you will work on reinforcing, optimizing, and scaling our neural network training infrastructure.
What You’ll Do
Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs
Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests
Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time
Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning
What You’ll Bring
Practical experience programming in Python and/or C/C++
Proficient in system-level software, in particular hardware-software interactions and resource utilization
Understanding of modern machine learning concepts and state of the art deep learning
Experience working with training frameworks, ideally PyTorch
Demonstrated experience scaling neural network training jobs across clusters of GPU’s
Experience programming in Cuda
Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc.)
Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data