Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Software Engineer AI Systems - vLLM MLPerf 
Canada, Quebec, Granby 
877568838

Today
Canada, Toronto
Canada, Remote
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you’ll be doing:

  • Design and implement highly efficient inference systems for large-scale deployments of generative AI models.

  • Define inference benchmarking methodologies and build tools that will be adopted across the industry.

  • Develop, profile, debug, and optimize low-level system components and algorithms to improve throughput and minimize latency for the MLPerf Inference benchmarks on bleeding-edge NVIDIA GPUs.

  • Productionize inference systems with uncompromised software quality.

  • Collaborate with researchers and engineers to productionize innovative model architectures, inference techniques and quantization methods.

  • Contribute to the design of APIs, abstractions, and UX that make it easier to scale model deployment while maintaining usability and flexibility.

  • Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals.

  • Stay up to date with the latest advancements and come up with novel research ideas in inference system-level optimization, then translate research ideas into practical, robust systems. Explorations and academic publications are encouraged.

What we need to see:

  • Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience.

  • 5+ years of experience in software development, preferably with Python and C++.

  • Deep understanding of deep learning algorithms, distributed systems, parallel computing, and high-performance computing principles.

  • Hands-on experience with ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM and SGLang).

  • Experience optimizing compute, memory, and communication performance for the deployments of large models.

  • Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools.

  • Ability to work closely with both research and engineering teams, translating state-of-the-art research ideas into concrete designs and robust code, as well as coming up with novel research ideas.

  • Excellent problem-solving skills, with the ability to debug complex systems.

  • A passion for building high-impact software that pushes the boundaries of what’s possible with large-scale AI.

Ways to stand out from the crowd:

  • Background in building and optimizing LLM inference engines such as vLLM and SGLang.

  • Experience building ML compilers such as Triton, Torch Dynamo/Inductor.

  • Experience working with cloud platforms (e.g., AWS, GCP, or Azure), containerization tools (e.g., Docker), and orchestration infrastructures (e.g., Kubernetes, Slurm).

  • Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code.

  • Contributions to open-source projects (please provide a list of the GitHub PRs you submitted).

You will also be eligible for equity and .

Applications for this job will be accepted at least until October 12, 2025.