Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Observability Tech Lead 
United States, California 
929521004

04.04.2024

What you will be doing:

  • Lead the design, development, and deployment of the observability platform, including metrics, logs, traces, events, alerts, dashboards, and visualizations.

  • Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations.

  • Establish and implement observability standards, guidelines, and processes across Nvidia.

  • Research, evaluate, and adopt new observability technologies and frameworks that can enhance user experience.

  • Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness.

  • Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events.

  • Handle large volumes of data and ensure data quality, security, and compliance.

  • Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads.

  • Find opportunities to automate remediation of commonly occurring issues to operate systems reliably and efficiently.

What we need to see:

  • Bachelor’s degree in computer science and Engineering, or related field, or equivalent experience.

  • 15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment.

  • Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Thanos, Cortex, Loki, Grafana, Alert Manager, ELK/Elastic Stack, Datadog, OpenTelemetry, etc.

  • Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Elastic Stack, etc.

  • Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc

  • Proficient in one or more programming languages, such as Go, Python, Java, C#, etc.

  • Passionate about observability and delivering high-quality internal platforms.

  • Demonstrated experience and expertise in using machine learning and Generative AI to develop solutions such as predictive monitoring, incident diagnosis, summarization and chatbots.

  • Experience with developing Observability solutions to monitor On-prem and Public Cloud environments.

  • Developed unified cloud observability platform to monitor Network, Compute, Storage, Operating Systems, Security, Applications, SaaS Platforms.

  • Understanding of implementing Observability solutions to large scale on-prem Infrastructure and Networking.

You will also be eligible for equity and .