Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Software Engineer Reliability Operational Excellence - DGX Cloud 
United States, Texas 
198951106

01.12.2024

What you’ll be doing:

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure to provide foundations for operational excellence.

  • Design, implement, ship, and maintain essential data pipelines that will be used by executive leadership to decide on business priorities

  • Integrate tooling with internal and customer customer workflows along with cloud service providers to streamline incident management process

  • Reduce the toil of running an incident, writing a postmortem, running an oncall, etc

  • Evangelize sustainable blameless incident prevention and incident response

  • Consult with and provide consultation for peer teams on operations best practices.

What we need to see:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.

  • 5+ years of experience.

  • A track record showing a good balance between initiating your own projects, convincing others to collaborate with you and collaborating well on projects initiated by others.

  • Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems in production.

  • Experience in one or more of the following: Python, Go, Typescript, C/C++, Java

  • In depth knowledge in one or more of Linux, Networking, Storage, and Containers.

Ways to stand out from the crowd:

  • Experience building and integrating with incident tooling such as FireHydrant, Rootly, incident.io, blameless. Experience building plugins, templates, and entity schemas in Backstage

  • Background with infrastructure technologies such as Kubernetes, terraform, docker, helm charts. Experience with basic ML and data science concepts and tooling such as Hive, Apache Beam, Apache Spark, etc

  • Experience with business analytics tooling such as Looker, Tableau. Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

You will also be eligible for equity and .