Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior Server RAS Engineer 
India, Karnataka, Bengaluru 
413453498

10.11.2025
India, Bengaluru
time type
Full time
posted on
Posted 5 Days Ago
job requisition id

a dedicated and experienced RAS (Reliability, Availability, and Serviceability). Senior Engineer. You willbe responsible for

What you will be doing:

  • Design, architect, and deliver server-level RAS for NVIDIA’s data centerproducts.

  • Define RAS requirements that ensure compliance with industry standards and customer expectations for scale-outenvironments.

  • Develop fault detection, isolation, and recovery mechanisms to ensure system resilience and minimizedowntime.

  • Evaluate andselect appropriatetechnologies andcomponentsto optimize reliability,availability, and serviceability, considering factors such as mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).

  • Collaborate withcustomers, vendors andsuppliers to assess and integrate their RAS-related solutions into the overall systemarchitecture.

  • Conduct system and cluster level simulations, analysis, and testingto validate andverify the effectiveness of the RAS architecture and itscomponents.

  • Stay up to date with the latest advancements in RAS techniques, fault tolerance mechanisms, and industry trends to guide future system designs.

  • Work with NVIDIA partners on RAS related architecture and discussions to improve their use of NVIDIAproducts.

  • Work on all phases of product development, from product definition, architecture, and design, through implementation,debugging, testing andearly customer support.

What we need to see:

  • BS, MS, or PhD or equivalent experience in EE/CS or related field of educationwith demonstrated experienceof 10+ years

  • Strong python programmingin Linux operating environment, strong understanding of Linux kernel internals, strong code review skills.

  • Extensive knowledge in system-level architecture invention, reliability engineering, and fault tolerancemechanisms, optimizing RASarchitectures for complex computing systems, data centers, or criticalapplications.

  • Proficient in scale-outarchitectures, handson experience are a plus.

  • Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rateanalysis).

  • Excellent problem-solving skills, attention to detail, and the ability to analyze complex system-level issues.

  • Possess excellentwritten and oral communication skills, excellent work ethics, a deep sense of collaboration, love to produce quality work and commitment to finishing your tasks every single day.

  • You are a self-starter who loves to find creative solutions to complicatedproblems.


Ways to stand out from the crowd:

  • Consistent trackrecord of doing RAS at platform level

  • Familiar with In-depth understanding of the interaction of machine check architecture and error flows with systemfirmware/software.

  • Hands on with x86 or ARM systemarchitecture.

the technology