Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior LLM Train Framework Engineer 
China, Shanghai 
695391338

Today
China, Shanghai
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you’ll be doing:

  • Build and develop open source Megatron Core.

  • Address extensive AI training and inference obstacles, covering the entire model lifecycle including orchestration, data pre-processing, conducting model training and tuning, and deploying models.

  • Work at the intersection of AI applications, libraries, frameworks, and the entire software stack.

  • Spearhead advancements in model architectures, distributed training strategies, and model parallel approaches.

  • Enhance the pace of foundation model training and optimization through mixed precision formulas and advanced NVIDIA GPU structures.

  • Performance tuning and optimizations of deep learning framework and software components.

  • Research, prototype, and develop robust and scalable AI tools and pipelines.

What we need to see:

  • MS, PhD or equivalent experience in Computer Science, AI, Applied Math, or related fields and 5+ years of industry experience.

  • Experience with AI train frameworks (e.g., PyTorch, JAX), and/or inference and deployment environments (e.g., TRTLLM, vLLM, SGLang).

  • Proficiency in decentralized instruction.

  • Proficient in Python programming, software development, debugging, performance analysis, test composition, and documentation.

  • CUDA or collective programming skills are a big plus.

  • Consistent record of working effectively across multiple engineering initiatives and improving AI libraries with new innovations.

  • Strong understanding of AI/Deep-Learning fundamentals and their practical applications.

Ways to stand out from the crowd:

  • Proficient in large-scale AI training, knowledgeable in compute system concepts like latency and efficiency.

  • Expertise in distributed computing, model parallelism, and mixed precision training.

  • Prior experience with Generative AI techniques applied to LLM and Multi-Modal learning (Text, Image, and Video).

  • Knowledge of GPU/CPU architecture and related numerical software.

  • Familiarity with cloud computing (e.g., complete pipelines for AI training and inference on CSPs like AWS, Azure, GCP, or OCI).