Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer - Storage 
United States, California 
168482653

31.07.2024

What You'll Be Doing

  • Design, implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of Nvidia.

  • Design and implement scalable and efficient Storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.

  • Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.

  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

What we need see:

  • BS in Computer Science (or equivalent experience) with 8+ years of relevant experience, MS with 5+ years of experience or Ph.D. with 3 years of experience.

  • 8+ years of experience crafting technology solutions and resolving performance bottlenecks for HPC applications.

  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.

  • Design, deployment and management of Enterprise NAS solutions like NetApp, Pure Storage and S3 storage.

  • Python/Bash/Golangprogramming/scriptingexperience.

  • Strong Experience operating services in any of the leading Cloud environment [ AWS, Azure or GCP].

  • Excellent communication and collaboration skills.

Ways To Stand Out Of The Crowd:

  • Background with RDMA (InfiniBand or RoCE) fabrics.

  • Experience with multiple monitoring stacks such as Prometheus+Grafana,Elasticsearch+Kibana,Splunk, Zabbix, etc. Familiarity with newer and emerging monitoring products.

  • Prior Experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.

  • Experience with containerization technologies, such as Docker, Mesosphere DCOS, Kubernetes (k8s).

You will also be eligible for equity and .