Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Reliability Availability Serviceability Expert 
United States, Texas 
767878326

24.06.2024

What you’ll be doing:

  • The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.

  • Own the AI system RAS/Resilience models, Benchmarking and Risk assessment.

  • Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.

  • Drive the end-to-end RAS efforts of chip-board-system to reduce FIT rates.

  • Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.

  • Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.

  • Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.

  • You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see:

  • BS or higher in EE, CE, CS, Mathematics, or equivalent experience.

  • 12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.

  • Proficient in Compute System RAS/Resilience model theory and methodology.

  • Proficient in HPC or AI system architecture and Cluster Interconnect technologies.

  • Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.

  • Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.

  • Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.

  • Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

You will also be eligible for equity and .