Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Nvidia AI Computing Performance Architect Perf Analysis Kernel Dev 
China, Shanghai 
197235200

Today
China, Shanghai
time type
Full time
posted on
Posted 5 Days Ago
job requisition id

performance analysis anddevelopment


What you'll be doing:

  • Design, develop, and optimize major layers in LLM (e.g attention, GEMM, inter-GPU communication) for NVIDIA's new architectures.

  • Implement and fine-tune kernels to achieve optimal performance on NVIDIA GPUs.

  • Conduct in-depth performance analysis of GPU kernels, including Attention and other critical operations.

  • Identify bottlenecks, optimize resource utilization, and improve throughput, and power efficiency

  • Create and maintain workloads and micro-benchmark suites to evaluate kernel performance across various hardware and software configurations.

  • Generate performance projections, comparisons, and detailed analysis reports for internal and external stakeholders.

  • Collaborate with architecture, software, and product teams to guide the development of next-generation deep learning hardware and software.

What we need to see:

  • MS or PhD in relevant discipline (CS, EE, Math)

  • 3+ years of industry experience in GPU programming or performance optimization for DL applications.

  • Demonstrated experience in analyzing and improving the performance of GPU kernels, with measurable results (e.g. performance improvements, efficiency gains).

  • Strong programming skills in C, C++, Perl, or Python

  • Strong background in computer architecture

  • Excellent communication skills, both written and verbal.

  • Strong organizational and time management abilities, with the ability to prioritize tasks effectively.

Ways to stand out from the crowd:

  • LLM FMHA or GEMM related development or optimization experience will be a plus

  • Expertise in CUDA programming for GPU acceleration will be a plus.

  • Expertise in GPU/CPU Core or MemSys architecture modeling will be a plus.