Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia DGX Cloud Performance Engineer 
United States, California 
448315536

18.08.2024

What you will be doing:

  • Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of meaningful applications and services;

  • Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective;

  • Develop ideas on how to improve the end to end system performance and usability by leading changes in the HW or SW (or both).

  • Collaborate with external CSPs during the full life cycle of cluster deployment and workload optimization to understand and drive standard methodologies

  • Collaborate with AI researchers, developers, and application service providers to understand difficulties, requirements, project future needs and share best practices

  • Work with a diverse set of LLM workloads and their application areas such as health care, climate modeling, pharmaceuticals, financial futures, Genomics/Drug discovery, among others.

  • Develop the vital modeling framework and the TCO analysis to enable efficient exploration and sweep of the architecture and design space;

  • Develop the methodology needed to drive the engineering analysis to advise the architecture, design and roadmap of DGX Cloud

What we need to see:

  • 7+ years of proven experience

  • Ability to work with large scale parallel and distributed accelerator-based systems

  • Expertise optimizing performance and AI workloads on large scale systems

  • Experience with performance modeling and benchmarking at scale

  • Strong background in Computer Architecture, Networking, Storage systems, Accelerators

  • Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM) among others

  • Experience with AI/ML models and workloads, in particular LLMs

  • Understanding of DNNs and their use in emerging AI/ML applications and services

  • Bachelors or Masters in Engineering (preferably, Electrical Engineering, Computer Engineering, or Computer Science) or equivalent experience

  • Proficiency in Python, C/C++

  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)

Ways to stand out from the crowd:

  • Very high intellectual curiosity; Confidence to dig in as needed; Not afraid of confronting complexity; Able to pick up new areas quickly

  • Proficiency in CUDA, XLA

  • Excellent interpersonal skills

  • PhD nice to have

You will also be eligible for equity and .