

You’ll help define how AI models are deployed and scaled in production, driving decisions on everything from memory orchestration and compute scheduling to inter-node communication and system-level optimizations. This is an opportunity to work with top engineers, researchers, and partners across NVIDIA and leave a mark on the way generative AI reaches real-world applications.
What You’ll Be Doing:
Design and evolve scalable architectures for multi-node LLM inference across GPU clusters.
Develop infrastructure tooptimize latency,throughput, and cost-efficiency of serving large models in production.
Collaborate with model, systems, compiler, and networking teams to ensure holistic, high-performance solutions.
Prototype novel approaches to KV cachehandling, tensor/pipelineparallel execution, and dynamic batching.
Evaluate and integrate new software and hardware technologies relevant to Core Spectrum-X technologies, such as load balancing, telemetry, congestion control, vertical application integration.
Work closely with internal teams and external partners to translate high-level architecture into reliable, high-performance systems.
Author design documents, internal specs, and technical blog posts and contribute to open-source efforts when appropriate.
What We Need to See:
Bachelor’s, Master’s, or PhD in Computer Science, Electrical Engineering, or equivalent experience.
8+ years of experiencebuilding large-scaledistributed systems or performance-critical software.
Deep understanding of deep learning systems, GPU acceleration, and AI model execution flows and/or high performance networking.
Solid software engineering skills in C++ and/or Python, preferably demonstrate strong familiarity with CUDA or similar platforms.
Strong system-level thinking across memory, networking, scheduling, and compute orchestration.
Excellent communication skills and ability to collaborate across diverse technical domains.
Ways to Stand Out from the Crowd:
Experience working on LLM - training or inference pipelines, transformer model optimization,or model-paralleldeployments.
Demonstrated success in profiling and optimizing performance bottlenecks across the LLM training or inference stack.
AI Accelerators and distributed communication patterns, congestion control and/or load balancing.
Proven optimization process for complex systems, deployed at scale to make impact.
Passion for solving tough technical problems and shipping high-impact solutions.
משרות נוספות שיכולות לעניין אותך