Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Observability Engineer AI HPC 
United States, California 
600639138

14.04.2025
US, CA, Santa Clara
time type
Full time
posted on
Posted 30+ Days Ago
job requisition id

What You’ll Be Doing:

  • Collaborate with AI, HW, SW engineering and research teams to deliver observability solutions that meet their needs in AI/HPC clusters.

  • Develop, test, and deploy data collectors, pipelines, visualization and retrieval services.

  • Define data collection and retention policies to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.

  • Work in a diverse team to provide operational and strategic data to empower our engineers and researchers to improve performance, productivity, and efficiency.

  • Continuously improve quality, workloads, and processes through better observability.

What We Need to See:

  • Experience developing large scale, distributed observability systems.

  • Ability to collaborate with data scientists, researchers, and engineering teams to identify high value data for collection and analysis.

  • Experience with turning raw data into actionable reports

  • Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools

  • Python programming experience and use of API calls

  • Passion for improving the productivity of others

  • Excellent planning and interpersonal skills

  • Flexibility/adaptabilityworking in a dynamic environment with changing requirements

  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field (or equivalent experience)

  • 8+ yrs of proven experience.

Ways To Stand Out from The Crowd:

  • Background in computer science, machine learning, deep learning, open-source software, infrastructure technologies, and GPU technology.

  • Prior experience in infrastructure software, production application software development, software development, release and support methodology and DevOps

  • Experience in the management of datacenters and large-scale distributed computing

  • Experience working with AI researchers and/or EDA developers

  • Consistent track record of driving process improvements and measuring efficiency and a passion for sharing knowledge and experience driving complex projects end-to-end.

You will also be eligible for equity and .