Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Infrastructure Build Systems Engineer 
United States, Texas 
81092509

Today
US, CA, Santa Clara
US, CA, Remote
time type
Full time
posted on
Posted 5 Days Ago
job requisition id

What you'll be doing:

  • Building and maintaining infrastructure from first principles needed to deliver TensorRT LLM

  • Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks. Managing tools and enabling automations for redundant manual workflows via Github Actions, Gitlab, Terraform, etc

  • Enable performing scans and handling of security CVEs for infrastructure components

  • Improve the modularity of our build systems using CMake

  • Use AI to help build automated triaging workflows

  • Extensive collaboration with cross-functional teams to integrate pipelines from deep learning frameworks and components is essential to ensuring seamless deployment and inference of deep learning models on our platform.

What we need to see:

  • Masters degree or equivalent experience

  • 3+ years of experience in Computer Science, computer architecture, or related field

  • Ability to work in a fast-paced, agile team environment

  • Excellent Bash, CI/CD, Python programming and software design skills, including debugging, performance analysis, and test design.

  • Experience with CMake.

  • Background with Security best practices for releasing libraries.

  • Experience in administering, monitoring, and deploying systems and services on GitHub and cloud platforms. Support other technical teams in monitoring operating efficiencies of the platform, and responding as needs arise.

  • Highly skilled in Kubernetes and Docker/containerd. Automation expert with hands-on skills in frameworks like Ansible & Terraform. Experience in AWS, Azure or GCP

Ways to stand out from the crowd:

  • Experience contributing to a large open-source deep learning community - use of GitHub, bug tracking, branching and merging code, OSS licensing issues handling patches, etc.

  • Experience in defining and leading the DevOps strategy (design patterns, reliability and scaling) for a team or organization.

  • Experience driving efficiencies in software architecture, creating metrics, implementing infrastructure as code and other automation improvements.

  • Deep understanding of test automation infrastructure, framework and test analysis.

  • Excellent problem solving abilities spanning multiple software (storage systems, kernels and containers) as well as collaborating within an agile team environment to prioritize deep learning-specific features and capabilities within Triton Inference Server, employing advanced troubleshooting and debugging techniques to resolve complex technical issues.

You will also be eligible for equity and .