Reduce wall clock time to convergence of our training jobs by identifying bottlenecks in the ML stack, from data-loading up to the GPU
Integrate efficient, low-level code with the overall high-level training framework
Profile our workloads and implement solutions to increase training efficiency
Optimize workloads for efficient hardware utilization (e.g. CPU and GPU compute, data throughput, networking)
What You’ll Bring
Members of the Autopilot AI Infrastructure team are expected to be adaptable to the dynamic requirements of AI research and capable of contributing across all parts of the AI training software stack
Practical experience programming in Python and/or C/C++
Experience programming in CUDA, cuDNN or Triton, particularly in the context of operations used in AI workloads
Experience profiling and optimizing CPU-GPU interactions (pipelining computation with data transfers, etc.)
Experience working with training frameworks (ideally PyTorch)
Proficient in system-level software, in particular hardware-software interactions and resource utilization
Experience with parallel programming concepts and primitives
Understanding of modern machine learning concepts and state of the art deep learning
Experience scaling neural network training jobs across many GPUs