The point where experts and best companies meet
Share
Key job responsibilities
- Design and maintain large-scale distributed training systems to support multi-modal foundation models for autonomous retailing. Optimize GPU utilization for efficient model training and fine-tuning on massive datasets.
- Develop robust monitoring and debugging tools to ensure the reliability and performance of training workflows on large GPU clusters. Design and maintain large-scale auto-labeling pipeline.
A day in the life
- 3+ years of non-internship professional software development experience, including coding standards, code reviews, source control management, build processes, testing, and operations.
- 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience.
- Proficient in Python or related language.
- Hands-on model training experience in PyTorch and deep learning frameworks such as MMEngine or Megatron-LM; experienced in large-scale deep learning or machine learning operations.
- Familiar with modern visual-language models, multi-modal AI systems, pre-training and post-training techniques. Proficient in training profilers and performance analysis tools to identify and optimize bottlenecks in model training.
- Master's or PhD degree in computer science or equivalent.
- 1+ years of experience in developing, deploying or optimizing ML models. Exceptional engineering skills in building, testing, and maintaining scalable distributed GPU training frameworks. Familiar with HuggingFace Transformers for vision-language modeling.
- Hands-on experience in large-scale multimodal LLM and generative model training. Contributions to popular open-source LLM frameworks or research publications in top-tier AI conferences, such as CVPR, ECCV, ICCV, ICLR, etc.
- Experience in GPU utilization and memory optimization techniques like kernel fusion and custom kernels, mixed precision training using lower precision and dynamic loss scaling, gradient (activation) checkpointing, gradient accumulation, offloading optimizer states, and smart prefetching, Fully Sharded Data Parallel (FSDP), tensor and pipeline model parallelism.
- Proven experience in large-scale video understanding tasks, with a focus on multi-modal learning that integrates visual and/or textual information; includes experience designing efficient data preprocessing pipelines, building and scaling multi-modal model architectures, and conducting robust evaluation at scale.
These jobs might be a good fit