Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Principal Infrastructure SRE - Storage 
United States, California 
137846522

01.12.2024

What you will be doing:

  • Lead initiatives and guide the development of On-Prem and Cloud infrastructure to ensure its scalability and performance.

  • Collaborate across teams to drive technical consensus around architecture and technology decisions.

  • Enhance the reliability, scalability, and efficiency of our core systems with a focus on infrastructure automation.

  • Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, and monitoring.

  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.

What we need to see:

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience with 12+ years of proven experience in infrastructure engineering with a focus on automation.

  • Proficient in Python or Go.

  • Experience with Large-scale data storage and compute clusters (HPC) infrastructure and Excellent working knowledge of storage systems (e.g. NFS, Lustre, GPFS, Ceph)

  • Experience with the design and deployment of virtualization architectures, including VMware, OpenShift or KubeVirt platforms

  • Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.

  • Experience in designing and implementing highly scalable Infrastructure Services with well-defined APIs.

  • Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools.

Ways to stand out from the crowd:

  • Deep understanding of other infrastructure components like DNS, LDAP, NIS, Security Tools etc.

  • Experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.

  • Experience with multiple monitoring stacks such as Prometheus + Grafana, Elasticsearch + Kibana, Splunk, Zabbix, etc.

  • Familiarity with newer and emerging monitoring tools.

You will also be eligible for equity and .