

a dedicated and experienced RAS (Reliability, Availability, and Serviceability). Senior Engineer. You willbe responsible for
What you will be doing:
Design, architect, and deliver server-level RAS for NVIDIA’s data centerproducts.
Define RAS requirements that ensure compliance with industry standards and customer expectations for scale-outenvironments.
Develop fault detection, isolation, and recovery mechanisms to ensure system resilience and minimizedowntime.
Evaluate andselect appropriatetechnologies andcomponentsto optimize reliability,availability, and serviceability, considering factors such as mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).
Collaborate withcustomers, vendors andsuppliers to assess and integrate their RAS-related solutions into the overall systemarchitecture.
Conduct system and cluster level simulations, analysis, and testingto validate andverify the effectiveness of the RAS architecture and itscomponents.
Stay up to date with the latest advancements in RAS techniques, fault tolerance mechanisms, and industry trends to guide future system designs.
Work with NVIDIA partners on RAS related architecture and discussions to improve their use of NVIDIAproducts.
Work on all phases of product development, from product definition, architecture, and design, through implementation,debugging, testing andearly customer support.
What we need to see:
BS, MS, or PhD or equivalent experience in EE/CS or related field of educationwith demonstrated experienceof 10+ years
Strong python programmingin Linux operating environment, strong understanding of Linux kernel internals, strong code review skills.
Extensive knowledge in system-level architecture invention, reliability engineering, and fault tolerancemechanisms, optimizing RASarchitectures for complex computing systems, data centers, or criticalapplications.
Proficient in scale-outarchitectures, handson experience are a plus.
Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rateanalysis).
Excellent problem-solving skills, attention to detail, and the ability to analyze complex system-level issues.
Possess excellentwritten and oral communication skills, excellent work ethics, a deep sense of collaboration, love to produce quality work and commitment to finishing your tasks every single day.
You are a self-starter who loves to find creative solutions to complicatedproblems.
Ways to stand out from the crowd:
Consistent trackrecord of doing RAS at platform level
Familiar with In-depth understanding of the interaction of machine check architecture and error flows with systemfirmware/software.
Hands on with x86 or ARM systemarchitecture.
the technology
משרות נוספות שיכולות לעניין אותך