Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Principal Infrastructure SRE - Compute 
United States, Texas 
470960577

31.07.2024

What you will be doing:

  • Lead initiatives to transform IT Compute platform architecture to build new service offerings across On-Prem & Cloud.

  • Define and implement metrics to measure the efficiency of compute platforms & services and drive efficiency.

  • Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes.

  • Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring.

  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.

What we need to see:

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience

  • 12+ years of proven experience in compute platform engineering with a focus on automation.

  • Experience with design and deployment of virtualization architectures, including VMware, Openshift or KubeVirt platforms.

  • Proven experience evaluating existing application architectures and identify opportunities for containerization to improve scalability, reliability, and efficiency.

  • Strong analytical skills with the ability to define and track key performance metrics.

  • Experience in developing tools for data analysis and performance profiling, Development with Terraform, Config Management tools.

  • Proficiency in programming languages such as Go and/or Python.

  • Experience with running large environments consisting of BareMetal, large scale virtualized environment with a mix of tens of thousands of VM’s and cloud infrastructure.

Ways to stand out from the crowd:

  • Deep understanding of other infrastructure components like Storage, DNS, AD, Security Tools etc..

  • Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.

  • Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools.

  • Understanding of AI ops and how to leverage LLMs to automate various optimization initiatives

You will also be eligible for equity and .