Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior AI Infrastructure Services Software Engineer 
United States, Texas 
763991207

07.04.2024

The candidate will work closely with cross-functional teams to design and develop common system software blocks within Kubernetes clusters (e.g., Custom Resource Definitions, Operators and system plug-ins) to meet the highly challenging and multi-faceted requirements of the NVIDIA Omniverse™ Cloud. They include but are not limited to elasticity, multitenancy, high availability, fault tolerance, debuggability, operational efficiency, and sustainability of the cluster-level services as needed to onboard and optimize omniverse applications and workflows at large scale. A key feature of the workflows to compose one or more high-performance simulation/AI tasks, streaming Kit-based applications of various types, and elastic microservices via the use of Cloud APIs.

What you will be doing:

  • Design and develop low-level system software solutions within Kubernetes to manage and schedule OVX cluster resources in order to power NVIDIA Omniverse™ Cloud (OVC).

  • Design and develop cluster-level system software solutions to map a wide range of Omniverse workloads to the high-performance interactive tasks (Kit-based applications), elastic microservices and simulation/AI tasks.

  • Collaborate with multiple Omniverse product teams to understand customer storage, compute requirements, and build supporting infrastructure.

  • Work across organizational boundaries with diverse hardware and software engineers.

  • Proactively identify and address system software challenges in compute, networking, and storage resource utilization that affect OVC’s availability, multi-tenancy, fault tolerance, debuggability, operational efficiency, and sustainability.

What we need to see:

  • 6+ years of hands-on system software engineering experience to extend the cluster-level services for large-scale Kubernetes

  • 4+ years of experience building large-scale distributed, fault-tolerant distributed services

  • Experience with cloud infrastructure platforms like AWS, Azure, and Google Cloud

  • Strong systems programming skills, including optimizations using multi-threading, asynchronous programming, concurrency and parallelism, caching, and batching

  • Proficiency in Python, C/C++ and Golang

  • Working knowledge of elasticity techniques within Kubernetes

  • Deep understanding of cloud technologies, distributed compute systems, and distributed systems and microservices architecture

  • Masters or PhD in Computer Science or a related field (or equivalent experience)

  • Excellent interpersonal skills and ability to work successfully with multi-functional teams, principles, and architects across organizational boundaries and geographies

Ways to stand out from the crowd:

  • Expert knowledge of virtualization and containerization technologies like Docker, VMware, KVM, etc

  • Strong knowledge of elasticity techniques within Kubernetes

  • Experience of co-designing high-performance application workflows with the underlying cluster-level software such as Slurm and/or Kubernetes

You will also be eligible for equity and .