Expoint – all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer HPC LSF
United States, Texas
294016948

14.04.2025

Share

Log in to apply

US, CA, Santa Clara

US, MA, Westford

US, TX, Austin

time type: Full time

posted on: Posted 4 Days Ago

job requisition id

What you’ll be doing:

Manage and support workload and resource schedulers in a large-scale HPC environment.
Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.
Develop solutions for complex computing resource management requirements.
Extract and leverage grid performance metrics for troubleshooting and performance optimization.
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
Develop, define and document standard methodologies to share with internal teams.
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.
Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).
Proficient in administering Centos/RHEL Linux distributions.
In depth understating of container technologies like Docker.
Proficiency in UNIX scripting languages and Python.
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
10+ years experience in a large, distributed Linux environment.
BS in Computer Science, similar degree or equivalent experience.

Ways to stand out from the crowd:

Experience analyzing and tuning performance for a variety of HPC or EDA workloads.
Solid understanding of cluster configuration managements tools such as Ansible.
Proficiency in Perl for maintaining legacy automation scripts.
Deep understanding of distributed system principles.
#LI-Hybrid

You will also be eligible for equity and .

Full job details

These jobs might be a good fit

Nvidia Site Reliability Engineer HPC LSF United States, Texas

Nvidia Senior Site Reliability Engineer HPC LSF India, Karnataka, Bengaluru

Apple Senior Site Reliability Engineer United States, West Virginia

Red hat Senior Site Reliability Engineer United States, North Carolina, Raleigh

Professional CV Builder tool from Expoint.

Get to the top of the "yes list" with a standout CV!

CREATE CV