Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer - Storage 
United States, California 
372208141

Yesterday
US, CA, Santa Clara
time type
Full time
posted on
Posted 19 Days Ago
job requisition id

What You'll Be Doing:

  • Design, implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of NVIDIA.

  • Design and implement advanced storage solutions, such as high-performance NFS, S3-compatible object storage, and distributed storage systems

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.

  • Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.

  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

What we need see:

  • BS in Computer Science (or equivalent experience) with 8+ years of relevant experience, MS with 5+ years of experience or Ph.D. with 3 years of experience

  • Deep experience with storage protocols such as nfs, NVMe/TCP, S3 and Lustre (LNet)

  • Experience with containerization technologies like Kubernetes and their integration with storage solutions

  • Proficiency in one or more programming languages (Python, GO) is a must.

  • Experience working with monitoring and configuration management tools such as Chef, Ansible, Puppet, Saltstack, etc

  • Background with cloud infrastructure - AWS, Azure or Google Cloud.

  • Experience with multiple monitoring stacks such as Prometheus+Grafana,Elasticsearch+Kibana.

  • Excellent communication and collaboration skills.

Ways To Stand Out Of The Crowd:

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software

  • Experience with RDMA (InfiniBand or RoCE) fabrics

  • Background with HPC cluster management tools such as Slurm, PBS, LSF, etc.

  • Passionate and experienced in AI methodologies.

You will also be eligible for equity and .