Share
Lead the development of our distributed training platform for large language models up to 400B parameters
Design high-performance training systems that produce models optimized for edge deploymentDrive innovation in both large-scale training and edge-optimized model deploymentKey job responsibilities- Design and optimize data parallelism, tensor parallelism, and pipeline parallelism strategies for large language models
- Implement memory optimization techniques like activation recomputation, ZeRO, and mixed precision training
- Develop infrastructure that supports novel distillation and compression techniques for edge deployment
- Create evaluation frameworks to measure performance of compressed models on target edge hardware
- Collaborate with ML scientists to optimize training for downstream compression requirements
- Benchmark and profile training configurations to maximize throughput and GPU utilization
- Build pipelines that connect large-scale training to edge model deployment workflowsA day in the life
You'll start your day analyzing performance metrics from overnight training runs, identifying bottlenecks that are limiting throughput on our GPU clusters. After a quick stand-up with the team, you might pair with an ML scientist to implement a new parallelism strategy that reduces memory usage while maintaining computational efficiency.
We tackle the full AI pipeline - from training massive models at scale to compressing and distilling them for efficient edge deployment. This end-to-end approach allows us to optimize each stage of the process specifically for our target devices, achieving capabilities that would be impossible with off-the-shelf solutions.
- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience as a mentor, tech lead or leading an engineering team
- Experience with distributed systems or high-performance computing
- Proficiency in Python and at least one systems programming language (C++, Rust, etc.)
- Experience with machine learning frameworks such as PyTorch or TensorFlow
- Understanding of GPU programming and optimization techniques
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Bachelor's degree in computer science or equivalent
- Experience scaling distributed training for large language models (30B+ parameters)
- Deep knowledge of PyTorch internals and distributed training modules
- Hands-on experience with parallelism strategies (Data, Tensor, Pipeline, ZeRO)
- Experience with model compression techniques (quantization, distillation, pruning)
- Experience optimizing GPU memory usage and communication patterns
- Knowledge of CUDA programming and custom kernel development
- Background in cloud infrastructure (AWS, Kubernetes) for ML workloads
- Experience with mixed precision training and quantization techniques
These jobs might be a good fit