Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Site Reliability Engineer - Metrics 
United States, Texas 
895110666

18.08.2024

As an, you will collaborate closely with cross-functional teams, including software engineers, data scientists, and operations, to monitor, analyze, and optimize our systems. Your primary responsibility will be to collect, analyze, and present key performance indicators (KPIs) that drive operational excellence and inform strategic decisions.


What you’ll be doing:

  • Develop, test, and deploy data collectors, pipelines, and services to enhance use of our AI/ML and chip development infrastructure

  • Participate in the full life-cycle of tool development, test, and deployment.

  • Work in a diverse team to provide operational and strategic metrics which empower our engineers to develop at the speed of light.

  • Continuously improve our chip develop process through better observability

  • Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

  • Experience in applying data analysis principles and influencing data-driven decisions

  • Experience with turning raw data into actionable reports

  • Hands-on experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open source tools

  • Authoritative level Python programming experience and use of API calls

  • Extensive experience with CI/CD pipelines such as Jenkins and/or GitLab

  • Passion for improving the productivity of others

  • Excellent planning and interpersonal skills

  • Flexibility/adaptabilityworking in a dynamic environment with changing requirements

  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience.

  • 5+yrs of relevant experience.

Ways to stand out from the crowd:

  • Hands-on experience running GPU-based workloads in a batch computing environment

  • Passion for gathering and visualizing metrics and data

  • Experience with chip design workflows, such as front end verification, back end workflows, or mixed signal workflows

  • Experience with job schedulers (in particular IBM Spectrum LSF and/or SLURM)

  • Mastery of distributed system principles

You will also be eligible for equity and .