Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior System Software Engineer - Performance 
United States, Texas 
167452574

28.07.2025
US, CA, Santa Clara
US, WA, Remote
US, CA, Remote
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

We are looking for an outstanding engineer for a System Performance Engineer role for at scale AI system performance and datacenter applications. Be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing! Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated Computing and Deep Learning software and hardware platforms, and with many researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, CPU and GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.

What you'll be doing:

  • Provide engineering solutions to enable deployment of world-class GPU computing products at scale, lead technical relationships with engineering teams, and assisting system administrators, software and hardware engineers, and machine learning/deep learning engineers in building creative solutions.

  • Lead aspects of performance analysis and scalable practices to support large scale infrastructure, deliver powerful tools, methodologies, and workflows to validate expectations.

  • Deliver engineering solutions to deliver continuous insights into performance of AI workloads over evolving environments, generating quick insights to improvements and regressions over time.

  • Decompose multi-faceted issues into minimal reproduction cases, working towards final root cause of underlying problems.

  • Participate and engage with multiple team members to develop best practices for understanding trends in test results and presenting data clearly to develop data driven actions.

What we need to see:

  • 5+ years of experience running multinode workloads and identifying bottlenecks and implementing improvements.

  • Proven understanding of high-performance computing based architectures and GPU accelerated computing software stacks and DL Frameworks (CUDA, PyTorch).

  • Experience with CPU architectures.

  • Experience with C/C++/Python/Bashprogramming/scripting.

  • Strong teamwork and communication skills.

  • Ability to multitask in a dynamic environment.

  • Action driven with strong analytical and analytical skills.

  • BS in Engineering, Mathematics, Physics, or Computer Science, MS or PhD desirable (or equivalent experience).

Ways to Stand Out From the Crowd:

  • Experience tuning memory, storage, and networking settings for performance on Linux systems.

  • Knowledge of modern Cloud and container-based architectures.

  • Hands-on experience deploying and debugging systems with NVIDIA NVLink and Infiniband.

  • Experience with multiple monitoring stacks such as Prometheus+Grafana,Elasticsearch+Kibana,Splunk, Zabbix, etc.

  • Demonstrated work with Open-Source software: building, debugging, patching and contributing code.

You will also be eligible for equity and .