Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer - Storage 
United States, California 
71792019

01.09.2024
What you'll be doing:
  • Design, implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of Nvidia.

  • Design and implement scalable and efficient Storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.

  • Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.

  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

What we need see:
  • BS ((or equivalent experience) in Computer Science with 8+ years of relevant experience, MS with 5+ years of experience or Ph.D. with 3 years of experience.

  • 8+ years of experience crafting technology solutions and resolving performance bottlenecks for HPC applications.

  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.

  • Design, deployment and management of Enterprise NAS solutions like NetApp, Pure Storage.

  • Experience in designing and managing Large scale On-Prem Object storage clusters.

  • Python/Golangprogramming/scriptingexperience is a must.

  • Strong Experience operating services in any of the leading Cloud environment [ AWS, Azure or GCP].

  • Excellent communication and collaboration skills.

Ways to stand out from the crowd:
  • Background with RDMA (InfiniBand or RoCE) fabrics.

  • Experience with multiple monitoring stacks such as Prometheus+Grafana,Elasticsearch+Kibana,Splunk, Zabbix, etc. Familiarity with newer and emerging monitoring products.

  • Prior Experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.

  • Experience with containerization technologies, such as Docker, Mesosphere DCOS, Kubernetes (k8s).

You will also be eligible for equity and .