The point where experts and best companies meet

Nvidia Senior Engineer - AI HPC Observability
United States, Texas
678223973

27.10.2025

US, CA, Santa Clara

US, TX, Austin

US, WA, Seattle

What You Will Be Doing:

Design and implement full-stack observability systemscovering metrics, logs, traces, and events for GPU-powered AI and HPC workloads.
Build large-scale telemetry data pipelinesleveraging OpenTelemetry, Kafka, Prometheus, and other distributed systems to ingest, process, and analyze massive data streams.
Develop analytics and anomaly detection frameworksto enable real-time visibility, performance optimization, and predictive insights across multi-tenant environments.
Architect and tune high-throughput data stores(e.g., TSDBs, columnar databases, OLAP systems) for large-scale observability data.
Drive self-service analytics capabilitiesthrough APIs, dashboards, and recommendation engines that empower developers and operators with actionable insights.
Collaborate with AI platform, GPU, and cloud infrastructure teamsto optimize observability for model training, inference workloads, and HPC performance.
Leverage machine learning and statistical techniquesfor correlation, anomaly detection, and intelligent alerting.
Contribute to performance tuning, scalability, and reliabilityof observability services across on-prem, and cloud environments.

What We Need To See:

BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.
8+ years of experiencein large-scale observability, data engineering, or performance monitoring systems.
Proven expertise in building and scalingobservability stacks(metrics, logs, traces, events) usingOpenTelemetry, Prometheus, Grafana, or Thanos.
Deep understanding ofdata collection, transformation, and storageat scale, experience with streaming frameworks (Kafka, Flink, Spark) preferred.
Hands-on experience withPython, Go, and/or Javafor backend development and automation.
Strong knowledge ofAPI design, data modeling, SQL/NoSQL, and data pipeline architecture.
Experience working withPromQL, time-series databases, and large-scale monitoring systems.
Familiarity withAI/ML pipelines, GPU-based workloads, and HPC environments.
Experience withanomaly detection, log analytics, and recommendation systemsusing ML or statistical techniques.
Excellent problem-solving, debugging, and performance-tuning skills in distributed systems.

Ways To Stand Out from The Crowd:

Proven experience designing and scaling full-stack observability platforms for large-scale AI, GPU, or HPC environments.
Hands-on expertise withOpenTelemetry,Prometheus,Kafka, and distributed data pipelines handling high-volume telemetry streams.
Strong background indata engineering, performance tuning, and time-series data modelingfor real-time analytics.
Demonstrated use ofmachine learning or statistical techniquesfor anomaly detection, correlation, or intelligent alerting.
Deep understanding ofAPI design, self-service observability, and building platforms that empower internal developers and operators.

You will also be eligible for equity and .

These jobs might be a good fit

Nvidia Senior Observability Engineer AI HPC United States, California

Get to the top of the "yes list" with a standout CV!

CREATE CV