Expoint - all jobs in one place

המקום בו המומחים והחברות הטובות ביותר נפגשים

Limitless High-tech career opportunities - Expoint

Nvidia Senior System Software Engineer - AI Performance Efficiency Tools 
United States, California 
64503711

14.04.2025
US, CA, Santa Clara
time type
Full time
posted on
Posted 25 Days Ago
job requisition id

What you’ll be doing:

  • Build internal profiling and analysis tools for AI workloads at large scale

  • Build debugging tools for common encountered problems like memory or networking

  • Create benchmarking and simulation technologies for AI system or GPU cluster

  • Partner with HW architects to propose new features or improve existing features with real world use cases

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development

  • Strong software skills in design, coding (C++ and Python), analytical, and debugging

  • Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.

  • Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking

  • Experience with NVIDIA GPUs, CUDA Programming and NCCL

  • Motivated self-starter with strong problem-solving skills and customer-facing communication skills

  • Passion for continuous learning. Ability to work concurrently with multiple global groups

Ways to stand out from the crowd:

  • Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms

  • Solid experience in large AI job performance analysis for training/inference workload

  • Knowledge of Linux device drivers and/or compiler implementation

  • Knowledge of GPU and/or CPU architecture and general computer architecture principles

You will also be eligible for equity and .