You’ll help define how AI models are deployed and scaled in production, driving decisions on everything from memory orchestration and compute scheduling to inter-node communication and system-level optimizations. This is an opportunity to work with top engineers, researchers, and partners across NVIDIA and leave a mark on the way generative AI reaches real-world applications.
What You’ll Be Doing:
- Design and evolve scalable architectures for multi-node LLM inference across GPU clusters.
- Develop infrastructure tooptimize latency,throughput, and cost-efficiency of serving large models in production.
- Collaborate with model, systems, compiler, and networking teams to ensure holistic, high-performance solutions.
- Prototype novel approaches to KV cachehandling, tensor/pipelineparallel execution, and dynamic batching.
- Evaluate and integrate new software and hardware technologies relevant to model inference (e.g., memory hierarchy, network topology, modern inference architectures).
- Work closely with internal teams and external partners to translate high-level architecture into reliable, high-performance systems.
- Author design documents, internal specs, and technical blog posts and contribute to open-source efforts when appropriate.
What We Need to See:
- Bachelor’s, Master’s, or PhD in Computer Science, Electrical Engineering, or equivalent experience.
- 5+ years of experiencebuilding large-scaledistributed systems or performance-critical software.
- Deep understanding of deep learning systems, GPU acceleration, and AI model execution flows.
- Solid software engineering skills in C++ and/or Python, with strong familiarity with CUDA or similar platforms.
- Strong system-level thinking across memory, networking, scheduling, and compute orchestration.
- Excellent communication skills and ability to collaborate across diverse technical domains.
Ways to Stand Out from the Crowd:
- Experience working on LLM inference pipelines, transformer model optimization,or model-paralleldeployments.
- Demonstrated success in profiling and optimizing performance bottlenecks across the LLM training or inference stack.
- Familiarity with data center-scaleorchestration, clusterschedulers, or AI service deployment pipelines.
- Passion for solving tough technical problems and shipping high-impact solutions.