What you'll be doing:
As a part of the service team, build and design platforms for DGX Cloud services
Figure out how to take best from HPC and Kubernetes and help us make the unified platform
Work within the team of software engineers and product people as well as engineering teams across all of NVIDIA on DGX Cloud AI Compute services
Write IaC code, work on Kubernetes, and help the team to design and implement release pipelines
Collaborate to understand how to make the best use of GitOps and Pipelines
What we need to see:
BS in Computer Science, Information Systems, Computer Engineering or equivalent experience
Solid technical foundation in distributed computing and storage, including substantial experience with all of the following: server systems, storage, I/O, networking, and system software
12+ years of platform engineering experience on large-scale production systems
Kubernetes and IaC expertise as an engineer
Ability to understand and communicate complex designs, distributed infrastructure, and requirements to peers, customers, and vendors
General shared storage knowledge such as NFS, LustreFS, GlusterFS, etc.
Familiarity with system-level architecture, such as interconnects, memory hierarchy, interrupts, and memory-mapped IO.
Ways to stand out from the crowd:
Proven experience in high performance computing, Deep Learning, and/or GPU accelerated computing domains
Large-scale distributed system, HPC, ML and Training experience with Slurm and Kubernetes
Deep knowledge of both software and hardware knowledge in HPC and ML infrastructure
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך