Share
What you'll be doing:
Performance optimization, analysis, and tuning of LLM, VLM and GenAI models for DL inference, serving and deployment in NVIDIA/OSS LLM frameworks.
Scale performance of LLM models across different architectures and types of NVIDIA accelerators.
Scale performance for max throughput, minimum latency and throughput under latency constraints.
Contribute features and code to NVIDIA/OSS LLM frameworks, inference benchmarking frameworks, TensorRT, and Triton.
Work with cross-collaborative teams across generative AI, automotive, image understanding, and speech understanding to develop innovative solutions.
What we need to see:
Bachelors, Masters, PhD, or equivalent experience in relevant fields (Computer Engineering, Computer Science, EECS, AI).
At least 12 years of relevant software development experience.
Excellent Python/C/C++ programming, software design and software engineering skills
Experience with a DL framework like PyTorch, JAX, TensorFlow.
Ways to stand out from the crowd:
Prior experience with a LLM framework or a DL compiler in inference, deployment, algorithms, or implementation
Prior experience with performance modeling, profiling, debug, and code optimization of aDL/HPC/high-performanceapplication
Architectural knowledge of CPU and GPU
GPU programming experience (CUDA or OpenCL)
You will also be eligible for equity and .
These jobs might be a good fit