Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Tesla Software Engineer Distributed Training AI Infrastructure 
United States, California, Palo Alto 
752724609

17.04.2025
What You’ll Do
  • Write robust Python software code in our machine learning training repository while applying best software practices to support the research team
  • ​Increase the reliability of our training jobs by debugging and root causing failures across thousands of nodes and implementing fixes to prevent future failures
  • ​Improve our training framework to support new training paradigms and experimentation methods
  • ​Build and improve our monitoring/observability infra to quickly debug cluster and training application issues
  • Profile and identify performance bottlenecks of training software in our training cluster
  • ​Coordinate with the supercomputing team managing the training cluster to maintain high availability and job throughput
What You’ll Bring
  • Members of the Autopilot AI Infrastructure team are expected to be adaptable to the dynamic requirements of AI research and capable of contributing across all parts of the AI training software stack
  • ​Practical programming experience in Python and/or C/C++
  • ​Experience working with ML training frameworks (ideally PyTorch)
  • ​Demonstrated experience scaling neural network training jobs across many GPUs
  • ​Experience with parallel programming concepts and primitives
  • ​Experience profiling and optimizing CPU-GPU interactions (pipelining computation with data transfers, etc)
  • ​Proficient in system-level software, in particular hardware-software interactions and resource utilization
  • ​Understanding of state-of-the-art deep learning concepts
  • ​Experience programming in CUDA/Triton and/or NCCL internals