

Share
NVIDIA is looking for Senior Cloud
What you'll be doing:
Maintain large scale HPC/AI clusters with monitoring, logging and alerting Manage Linux job/workload schedulers and orchestration tools.
Develop and maintain continuous integration and delivery pipelines
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
Deploy monitoring solutions for the servers, network and storage.
Perform troubleshooting bottom up from bare metal, operating system, software stack and application level.
Being a technical resource, develop, re-define and document standard methodologies to share with internal teams Support Research & Development activities and engage in POCs/POVs for future improvements .
What we need to see:
BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.
Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments.
Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting.
Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity.
Excellent knowledge of Windows and Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.
Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS. Familiarity with newer and emerging storage technologies is a plus.
Proficiency in Python programming and bash scripting.
Knowledge of CI/CD pipelines for software deployment and automation.
Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.
Ability to communicate technical concepts and collaborate effectively with Mandarin-speaking customers.
Ways to stand out from the crowd:
Knowledge of CPU and/or GPU architecture .
Knowledge of Kubernetes, container related microservice technologies.
Experience with GPU-focused hardware/software (DGX, CUDA.)
Background with RDMA (InfiniBand or RoCE) fabrics.
These jobs might be a good fit

Share
What you'll be doing:
Lead the hands-on analysis, optimization, and performance tuning of complex GPU-accelerated systems and AI workloads, ensuring high availability and efficiency across customer data centers.
Engage with NVIDIA strategic customers to drive AI infrastructure initiatives, support deployment success, and influence long-term platform adoption.
Serve as a senior technical authority on NVIDIA GPU, DPU, and networking technologies, contributing to architecture reviews and guiding infrastructure decisions at scale.
Collaborate with internal Engineering, Product, and Sales teams to align customer deployments with NVIDIA’s technology roadmap and business objectives.
Establish and refine monitoring and optimization methodologies using analytics, telemetry, and automation to detect bottlenecks and improve infrastructure resiliency.
Participate in post-deployment reviews, incident retrospectives, and strategic planning sessions to shape the customer experience and feed insights into NVIDIA’s infrastructure strategy.
Complete and lead complex technical projects from initial design through implementation and continuous improvement, ensuring alignment to SLAs and mitigation of technical risks.
Support business growth by identifying AI infrastructure opportunities in cloud and enterprise environments and driving technical initiatives that showcase NVIDIA’s leadership in this space.
What we need to see:
10+ years of experience in large-scale data center service operations with a focus on infrastructure performance, backed by a Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related field.
Strong analytical, solving problems, and decision-making skills, capable of identifying root causes, driving continuous improvement, and delivering resilient technical solutions.
Strong communication, time management, and organizational skills, with the ability to lead complex projects, guide technical teams, and meet important metrics.
Preferred certifications in data center, server, or networking technologies, and a willingness to travel up to 25% for customer engagements and team collaboration.
Proficiency in system-level aspects, encompassing Operating Systems, Linux kernel drivers, GPUs, NICs, and hardware architecture.
Demonstrated expertise in cloud orchestration software and job schedulers, including platforms like Kubernetes, Docker Swarm, and HPC-specific schedulers such as Slurm.
Familiarity with cloud-native technologies and their integration with traditional infrastructure is crucial.
Proficiency in both Japanese and English, with the ability to communicate complex technical topics clearly across multicultural teams and with customers.
Ways to stand out from the crowd:
Deep familiarity with AI infrastructure and workflows, including training/inference pipelines, MLOps/DevOps tools, containerization (Docker, Kubernetes), and large-scale system deployments.
Knowledge of data center infrastructure operations, including safety, security, environmental controls, and standard operating procedures.
Proven expertise in scaling complex systems, with deep experience in automation, orchestration, and performance optimization across compute, storage, and networking layers.
Good interpersonal and collaboration skills, with the ability to lead discussions, influence outcomes, and build positive relationships with both internal and external collaborators.
These jobs might be a good fit

Share
What you'll be doing:
Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting Manage Linux job/workload schedulers and orchestration tools.
Develop and maintain continuous integration and delivery pipelines .
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
Deploy monitoring solutions for the servers, network and storage.
Perform troubleshooting bottom up from bare metal, operating system, software stack and application level.
Being a technical resource, develop, re-define and document standard methodologies to share with internal teams Support Research & Development activities and engage in POCs/POVs for future improvements .
Regional travel is required for on-site visits with customers.
What we need to see:
BS/MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields with at least 8 years work or research experience in networking fundamentals, TCP/IP stack, and data center architecture.
Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software.
Direct design, implementation and management experience with cloud computing platforms (e.g. AWS, Azure, Google Cloud).
Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes and Singularity.
Hands-on, adaptable problem-solver with a collaborative approach and strong ability to thrive in fast-paced, dynamic environments, working effectively with cross-functional teams to deliver innovative solutions
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience.
Comfortable with automation and configuration management tools including Jenkins, Ansible, Puppet/Chef, etc.
Deep knowledge of Networking Protocols like InfiniBand, Ethernet Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).
Strong written, verbal, and listening skills in English are critical.
Ways to stand out from the crowd:
Knowledge of CPU and/or GPU architecture .
Knowledge of Kubernetes, container related microservice technologies.
Experience with GPU-focused hardware/software (DGX, CUDA.)
Background with RDMA (InfiniBand or RoCE) fabrics.
These jobs might be a good fit

Share
NVIDIA is looking for Senior Cloud
What you'll be doing:
Maintain large scale HPC/AI clusters with monitoring, logging and alerting Manage Linux job/workload schedulers and orchestration tools.
Develop and maintain continuous integration and delivery pipelines
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
Deploy monitoring solutions for the servers, network and storage.
Perform troubleshooting bottom up from bare metal, operating system, software stack and application level.
Being a technical resource, develop, re-define and document standard methodologies to share with internal teams Support Research & Development activities and engage in POCs/POVs for future improvements .
What we need to see:
BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.
Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments.
Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting.
Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity.
Excellent knowledge of Windows and Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.
Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS. Familiarity with newer and emerging storage technologies is a plus.
Proficiency in Python programming and bash scripting.
Knowledge of CI/CD pipelines for software deployment and automation.
Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.
Ability to communicate technical concepts and collaborate effectively with Mandarin-speaking customers.
Ways to stand out from the crowd:
Knowledge of CPU and/or GPU architecture .
Knowledge of Kubernetes, container related microservice technologies.
Experience with GPU-focused hardware/software (DGX, CUDA.)
Background with RDMA (InfiniBand or RoCE) fabrics.
These jobs might be a good fit