

What you’ll be doing:
Responsible for the development and execution of NVIDIA HGX/DGX platform test plan on OS, FW and CUDA SW stack from design doc.
Installing and testing various systems OS, system firmware and software stack including Windows & Linux
Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
Leverage AI (Language Model) skills to build automation front-end and back-end framework which could interaction with human
Review partner and supplier test results and prescribe additional reliability testing on components, systems, and packaging as needed.
What we need to see:
Experience in using AI development tools for test plans creation, test cases development and test cases automation
5+ years of hands-on experience in test development, software development, automation, or software engineering.
Strong OS (Ubuntu, RedHat, CentOS, SuSE, Fedora, Windows, etc.) trouble-shooting and debugging experience in a bare-metal and KVM/VMWare environment.
Ability to write test plans focusing on functional, performance, stress and negative testing.
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
Ways to stand out from the crowd:
Experience working with NVIDIA GPU hardware is a strong plus and rack level operation is a strong plus
Have implemented error handling for x86/ARM based servers, online and offline health monitoring tools.
Experience of developing x86/ARM based environment
Background in parallel programming ideally CUDA/OpenCL is a plus
Strong experience in FW, BMC/OpenBMC, SBIOS, Network protocol, enterprise storage devices, Redfish
משרות נוספות שיכולות לעניין אותך