Finding the best job has never been easier
Share
What you will be doing:
Lead the design, development, and deployment of the observability platform, including metrics, logs, traces, events, alerts, dashboards, and visualizations.
Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations.
Establish and implement observability standards, guidelines, and processes across Nvidia.
Research, evaluate, and adopt new observability technologies and frameworks that can enhance user experience.
Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness.
Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events.
Handle large volumes of data and ensure data quality, security, and compliance.
Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads.
Find opportunities to automate remediation of commonly occurring issues to operate systems reliably and efficiently.
What we need to see:
Bachelor’s degree in computer science and Engineering, or related field, or equivalent experience.
15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment.
Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Thanos, Cortex, Loki, Grafana, Alert Manager, ELK/Elastic Stack, Datadog, OpenTelemetry, etc.
Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Elastic Stack, etc.
Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc
Proficient in one or more programming languages, such as Go, Python, Java, C#, etc.
Passionate about observability and delivering high-quality internal platforms.
Demonstrated experience and expertise in using machine learning and Generative AI to develop solutions such as predictive monitoring, incident diagnosis, summarization and chatbots.
Experience with developing Observability solutions to monitor On-prem and Public Cloud environments.
Developed unified cloud observability platform to monitor Network, Compute, Storage, Operating Systems, Security, Applications, SaaS Platforms.
Understanding of implementing Observability solutions to large scale on-prem Infrastructure and Networking.
You will also be eligible for equity and .
These jobs might be a good fit