Work with a wide variety of teams at Tesla to accelerate time-to-market for new ML models
Design, implement, and deploy low-overhead instrumentation methods for troubleshooting performance issues
Analyze collected telemetry, identifying bottlenecks and designing practical solutions to overcome those bottlenecks
Develop data-driven performance improvements to existing software pipelines
Working with each team to validate these performance improvements and incorporate them into production training runs
Help make Tesla's ambitious AI-related products and services a reality
What You’ll Bring
A deep understanding of the internals of GPU-based training and inferencing workloads, especially handoffs of data and computation between host CPUs and GPUs
Real-world knowledge of supporting languages and libraries used in large-scale AI training runs (CUDA/ZLUDA, OpenCL, PyTorch, Tensorflow, GPUDirect and other RDMA-enabling services)
Experience developing and tuning low-level software using languages like C, x86 assembly, and Rust
Practical experiences using different performance analysis techniques (profiling, tracing, simulative analysis) and when each should be applied
Excellent spoken and written communication skills, including the ability to concisely communicate data-driven root causes of performance issues and how they can be remedied
An irrational love for high-performance computing and extracting the maximum number of productive FLOPs from modern AI-oriented architectures