What you will be doing:
Maintain an up-to-date understanding of the philosophy, architecture, and deployment methods of various evolving NVIDIA Reference Architectures—e.g., NVIDIA DGX SuperPOD Reference Architecture, NVIDIA Cloud Partner Reference Architecture, and NVIDIA Enterprise Reference Architecture.
Analyze and understand the requirements of customer-initiated AI training or inference clusters.
Identify the NVIDIA Reference Architecture that best matches customer needs and effectively communicate its value proposition to collaborators.
Facilitate seamless communication between NVIDIA's internal deployment teams and customers during the implementation of AI clusters based on Reference Architectures.
Provide hands-on technical support to developers after the AI Factory has been deployed, ensuring that AI training and inference workloads run effectively on the infrastructure.
What we need to see:
Bachelor’s degree or higher in Computer Science, Computer Engineering, or a related technical field.
Solid understanding of basic principles behind cluster orchestration, such as compute resource provisioning and dynamic prioritization based on user demand.
Minimum of 3 years of hands-on experience operating AI training or inference clusters that leverage Kubernetes with NVIDIA GPUs.
Proficiency in key technologies including: Container Runtime Interface (CRI), Container Network Interface (CNI), Calico, NVIDIA GPU Operator, NVIDIA Network Operator, and Kubeflow Training Operator.
Ways to stand out from the crowd:
Foundational knowledge and experience with network technologies—such as InfiniBand and Ethernet—in AI cluster environments, including compute fabric interconnects between GPU servers, storage fabric integration, and in-band networks for system administration.
Familiarity with the role of storage in AI training/inference clusters, including hands-on experience with vector databases and leading commercial storage solutions.
Experience integrating MLOps platforms into Kubernetes environments, such as deploying Airflow for orchestrating distributed training workloads.
משרות נוספות שיכולות לעניין אותך