Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Nvidia Senior Engineer - AI HPC Observability 
United States, Texas 
678223973

Today
US, CA, Santa Clara
US, TX, Austin
US, WA, Seattle
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What You Will Be Doing:

  • Design and implement full-stack observability systemscovering metrics, logs, traces, and events for GPU-powered AI and HPC workloads.

  • Build large-scale telemetry data pipelinesleveraging OpenTelemetry, Kafka, Prometheus, and other distributed systems to ingest, process, and analyze massive data streams.

  • Develop analytics and anomaly detection frameworksto enable real-time visibility, performance optimization, and predictive insights across multi-tenant environments.

  • Architect and tune high-throughput data stores(e.g., TSDBs, columnar databases, OLAP systems) for large-scale observability data.

  • Drive self-service analytics capabilitiesthrough APIs, dashboards, and recommendation engines that empower developers and operators with actionable insights.

  • Collaborate with AI platform, GPU, and cloud infrastructure teamsto optimize observability for model training, inference workloads, and HPC performance.

  • Leverage machine learning and statistical techniquesfor correlation, anomaly detection, and intelligent alerting.

  • Contribute to performance tuning, scalability, and reliabilityof observability services across on-prem, and cloud environments.


What We Need To See:

  • BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.

  • 8+ years of experiencein large-scale observability, data engineering, or performance monitoring systems.

  • Proven expertise in building and scalingobservability stacks(metrics, logs, traces, events) usingOpenTelemetry, Prometheus, Grafana, or Thanos.

  • Deep understanding ofdata collection, transformation, and storageat scale, experience with streaming frameworks (Kafka, Flink, Spark) preferred.

  • Hands-on experience withPython, Go, and/or Javafor backend development and automation.

  • Strong knowledge ofAPI design, data modeling, SQL/NoSQL, and data pipeline architecture.

  • Experience working withPromQL, time-series databases, and large-scale monitoring systems.

  • Familiarity withAI/ML pipelines, GPU-based workloads, and HPC environments.

  • Experience withanomaly detection, log analytics, and recommendation systemsusing ML or statistical techniques.

  • Excellent problem-solving, debugging, and performance-tuning skills in distributed systems.

Ways To Stand Out from The Crowd:

  • Proven experience designing and scaling full-stack observability platforms for large-scale AI, GPU, or HPC environments.

  • Hands-on expertise withOpenTelemetry,Prometheus,Kafka, and distributed data pipelines handling high-volume telemetry streams.

  • Strong background indata engineering, performance tuning, and time-series data modelingfor real-time analytics.

  • Demonstrated use ofmachine learning or statistical techniquesfor anomaly detection, correlation, or intelligent alerting.

  • Deep understanding ofAPI design, self-service observability, and building platforms that empower internal developers and operators.

You will also be eligible for equity and .