Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Principal AI Infrastructure SRE Engineer 
United States, California 
105392985

Today
US, CA, Santa Clara
time type
Full time
posted on
Posted 7 Days Ago
job requisition id

What you will be doing:

  • Lead initiatives to transform IT Infrastructure platform architecture and services On-Prem for modern AI workloads and AI semi conductor and software development.

  • Collaborate with partners to design architecture, Build & Operate platforms that transform Storage, Compute & Middleware with modern security paradigms.

  • Build software and automation to run infrastructure at scale with minimal human intervention. Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring.

  • Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes.

  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.

What we need to see:

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.

  • 15+ years of proven experience in compute platform engineering with a focus on automation.

  • Experience with design, deployment and operation of infrastructure that supports AI and SW development at scale including Kubernetes, integrating modern AI Data infrastructure platforms into Kubernetes workloads.

  • Proven experience integrating existing application architectures and build new identify opportunities for containerization to improve scalability, reliability, and efficiency.

  • Proficiency in programming languages such as Go and/or Python. Experience in developing tools for data analysis and performance profiling, Development with Terraform, Config Management tools.

  • Experience with designing and running large environments consisting of BareMetal servers/virtualized environments with a mix of tens of thousands of VMs and cloud infrastructure or AI infrastructure.

  • Deep understanding of other infrastructure components like Storage, DNS, AD, Security Tools etc..

Ways to stand out from the crowd:

  • Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools.

  • Understanding of AI ops and how to leverage LLMs to automate various optimization initiatives

You will also be eligible for equity and .