Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Infrastructure Performance Development Engineer 
United States, California 
384670928

24.06.2024

as developing

What you will be doing:

  • Build tools and frameworks that provide real time application performance metrics that can be correlated with system metrics

  • Develop automation frameworks that empower applications to thoughtfully predict and overcomesystem/infrastructurefailures, ensuring fault tolerance.

  • Collaborate with software teams to pinpoint performance bottlenecks. Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.

  • Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.

  • Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.

What We Need to See:

  • BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.

  • Proven experience in least one of the following area:

5+ years of experience in analyzing and improving performance of training applications using PyTorch or similar framework

5+years of experience with building distributed software applications

5+years of experience in building storage solutions for Deep Learning applications

10+years of experience in building automated fault tolerant distributed applications

10+years building tools for bottleneck analysis and automation of fault tolerance in distributed environments.

  • Strong background in parallel programming and distributed systems

  • Experience analyzing and optimizing large scale distributed applications.

  • Excellent verbal and written communication skills

Ways To Stand Out From The Crowd:

  • Deep understanding of HPC and distributed system architecture with emphasis on RDMA

  • Hands on working experience in more than one of the above areas especially with performance analysis and profiling of Deep Learning workloads.

  • Comfortable navigating and working with the PyTorch codebase.

  • Proven understanding of CUDA and GPU architecture

You will also be eligible for equity and .