Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior System Software Engineer Distributed Systems - DGX Cloud 
United States, Texas 
463073378

31.07.2024

What you’ll be doing:

  • We are designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.

  • Design, develop, test, debug, and optimize creative solutions for Datacenter firmware throughout lifecycle.

  • Work closely with hardware, software, infrastructure, and business teams to transform new firmware features from idea to reality.

  • Define server-level reliability, availability, and serviceability requirements in collaboration with various customers like CSPs and deliver fault resilient solution at scale as per customer expectations.

  • Collaborate with hardware, software and firmware teams to drive failure analysis and large scale solution deployment.

  • Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications.

What we need to see:

  • BS, MS, or PhD in EE/CS or related field of education (or equivalent experience) with 6+ years of experience active development using Python as primary programming language using Linux as OS.

  • Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles and architects and coordinate effectively across organizational boundaries and geographies.

  • Familiarity with industry standards and specifications such as SPI, I2C, PCIe, UEFI and PLDM.

  • System knowledge - how platform management works - areas like BMC-BIOS communication, thermal management, power management, firmware update, device monitoring, firmware security, etc.

  • Expert level knowledge of a systems programming language (Go, Python) and a solid understanding of Data Structure and Algorithms.

  • Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture, data synchronization, fault tolerance and state management.

Ways to stand out from the crowd:

  • Background with In-depth understanding of the interaction of machine check architecture and error flows with system firmware/software.

  • Familiar with Linux server design, x86/ARM system architecture, interconnects like PCI, and other I/O buses.

  • Proven operational excellence in designing and maintaining cloud AI infrastructure. Proficiency in architecting and running large-scale distributed systems, independent of cloud providers.

You will also be eligible for equity and .