About the Role
What the Candidate Will Do ----
- Design and build tools to empower production teams to innovate and productionize state-of-the-art deep learning models at Uber.
- Develop and maintain scalable, end-to-end deep learning training systems and frameworks.
- Ensure distributed training tools are reliable, efficient, flexible to use for new production use cases.
- Collaborate with cross-functional teams including machine learning engineers, backend engineers, data scientists, and data engineers to deliver robust ML solutions for Uber.
- - - - Basic Qualifications ----
- Master in relevant fields (CS, EE, Math, Stats, etc.) AND 6-years full-time Software Engineering work experience in deep learning
- Proficiency in Python and PyTorch
- Expertise in designing, debugging, and optimizing distributed deep learning systems.
- Working experience of distributed training in PyTorch at Scale (e.g., data parallelism, model parallelism).
- Strong ability to translate complex DL requirements and problems into scalable solutions.
- - - - Preferred Qualifications ----
- Expertise in distributed training frameworks such as DDP, DeepSpeed, FSDP, or TorchRec.
- Familiarity with C++, Go or CUDA programming.
- Expertise in optimizing GPU/TPU training performance and data loading efficiency.
- Familiarity with large-scale distributed infrastructure tools like Ray, OpenAI Triton, PyTorch Lightning.
- Built and deployed end-to-end machine learning systems in production.
- Experience training large models (10B+ parameters), such as large recommendation systems or large language models (LLMs)
- PhD in relevant fields (CS, EE, Math, Stats, etc.)
For San Francisco, CA-based roles: The base salary range for this role is USD$223,000 per year - USD$248,000 per year.
For Seattle, WA-based roles: The base salary range for this role is USD$223,000 per year - USD$248,000 per year.
For Sunnyvale, CA-based roles: The base salary range for this role is USD$223,000 per year - USD$248,000 per year.