Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Deep Learning Systems Engineer Datacenters 
United States, California 
19048957

Today
US, CA, Santa Clara
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

What you'll be doing:

  • Help develop software infrastructure to characterize and analyze a broad range Deep Learning applications

  • Evolve cost-efficient datacenter architectures tailored to meet the needs of Large Language Models (LLMs).

  • Work with experts to help develop analysis and profiling tools in Python, bash and C++ to measure key performance metrics of DL workloads running on Nvidia systems.

  • Analyze system and software characteristics of DL applications.

  • Develop analysis tools and methodologies to measure key performance metrics and to estimate potential for efficiency improvement.

What we need to see:

  • A Bachelor’s degree in Electrical Engineering or Computer Science or equivalent experience (Masters or PhD degree preferred).

  • 8 years or more of relevant experience.

  • Experience in at least one of the following:

    • System Software: Operating Systems (Linux), Compilers, GPU kernels (CUDA), DL Frameworks (PyTorch, TensorFlow).

    • Silicon Architecture and Performance Modeling/Analysis: CPU, GPU, Memory or Network Architecture

  • Experience programming in C/C++ and Python. Exposure to Containerization Platforms (docker) and Datacenter Workload Managers (slurm) is a plus.

  • A deep understanding of computer system architecture and performance analysis is essential for success in this role. Applicants should have demonstrated hands-on experience in these domains.

  • Demonstrated ability to work in virtual environments, and a strong drive to own tasks from beginning to end. Prior experience with such environments will make you stand out.

Ways to stand out from the crowd:

  • Background with system software, Operating system intrinsics, GPU kernels (CUDA), or DL Frameworks (PyTorch, TensorFlow).

  • Experience with silicon performance monitoring or profiling tools (e.g. perf, gprof, nvidia-smi, dcgm).

  • In depth performance modeling experience in any one of CPU, GPU, Memory or Network Architecture

  • Exposure to Containerization Platforms (docker) and Datacenter Workload Managers (slurm).

  • Prior experience with multi-site teams or multi-functional teams.

You will also be eligible for equity and .