Perform scaling law analyses on model size, data size, data mixture, training compute, and other critical parameters to optimize our AI models using the largest self-driving dataset in the world
Develop and implement novel architectures and algorithms to effectively scale large End-to-End (E2E) self-driving models
Create and maintain infrastructure for efficient, large-scale distributed training of E2E models, resolving compute and memory bottlenecks for training and inference
Evaluate and enhance model performance, with a focus on increasing miles driven without human intervention
Work closely with cross-functional teams to deploy AI models in production, ensuring they meet stringent performance and reliability standards
Contribute to the development of tools and frameworks that improve the scalability and efficiency of model training and deployment processes
What You’ll Bring
Proven experience in scaling and optimizing large AI models, with a strong understanding of infrastructure challenges and solutions
Proficiency in Python and a deep understanding of software engineering best practices
In-depth knowledge of deep learning fundamentals, including optimization techniques, loss functions, and neural network architectures
Experience with deep learning frameworks such as PyTorch, TensorFlow, or JAX
Strong expertise in distributed computing and parallel processing techniques
Demonstrated ability to work collaboratively in a cross-functional team environment
Strong problem-solving skills and the ability to troubleshoot complex system-level issues