The ideal candidate is clearly passionate about new opportunities and has a demonstrable track record of success in delivering new features and products. A commitment to team work, hustle, and strong communication skills (to both business and technical partners) are absolute requirements. Creating reliable, scalable, and high performance products requires exceptional technical expertise, a sound understanding of the fundamentals of Computer Science, and practical experience building large-scale distributed systems. This person has thrived and succeeded in delivering high quality technology products/services in a hyper-growth environment where priorities shift fast.Key job responsibilities
- Responsible for pre and post-training multimodal LLMs.
- Scale training of models on hyper large GPU and AWS Trainium clusters
- Optimize training workflows using distributed training/parallelism techniques
- Optimize low-level details of the training stack, including CUDA kernels, communication collectives, network I/O.
- Utilize, build and extend upon industry leading frameworks (NeMo, Megatron Core, PyTorch, Jax, vLLM, TRT, etc)- Deliver results independently in a self organizing Agile environment while constantly embracing and adapting new scientific advances
- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience as a mentor, tech lead or leading an engineering team
- 2+ years of expertise in Machine Learning and/or Model Training.
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Master's degree in machine learning or equivalent
- Hands-on experience and expertise in training Foundational Models/LLMs, and/or low-level optimization of ML training workflows, CUDA kernels, network I/O.
משרות נוספות שיכולות לעניין אותך