Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Principal Site Reliability Engineer - Enterprise AI Platform 
United States, California 
306170782

Today
US, CA, Santa Clara
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

What you will be doing:

  • Collaborate on translating business objectives into actionable plans

  • Address operational challenges, automate processes, and iterate for efficiency

  • Tackle systemic reliability issues with multi-functional teams.

  • Monitor, optimize, and manage system performance and resources.

  • Institute validated practices for reliability, remediations, and troubleshooting.

  • Design, deploy, and automate production support, documenting essential knowledge.

  • Navigate intricate tasks with a deep understanding of SRE principles.

  • Lead cross-organizational projects from inception to completion.

  • Mentor and train junior engineers for professional development.

  • Serve as a subject matter expert in core team functions.

What we need to see:

  • 15+ years of working experience in cloud, platform or SRE roles

  • A Bachelors or Masters Degree in an Engineering or Computer Science or related field or equivalent experience

  • Proficient in one or more programming languages: Python, Go, Perl, or Ruby.

  • Hands-on experience handling and scaling distributed systems in a public, private, or hybrid cloud, on-prem environment 24x7x365

  • Has delivered software with full understanding of deploying applications in Kubernetes clusters along with GPU and CPU pod scheduling (Ability to understand on Prem)

  • Has maintained and managed Micro-services relating to AI platforms (Inference, Training, Evaluation, Ingestion)

  • Hands-on experience in deploying, supporting, and supervising new and existing services, platforms, and application stacks.

  • Experience with CI/CD systems such as Jenkins, GitHub Actions, etc.

  • Background with Infrastructure as Code (IaC) methodologies and relevant tools.

  • Extensive experience working with MS Windows Server and/or Linux operating systems.

  • Solid communication skills, demonstrating the ability to comprehend and articulate technical issues to a non-technical audience.

Ways to stand out from the crowd:

  • Cloud expertise in Azure and AWS.

  • Passionate and experienced in AI methodologies.

  • Strong background in software design and development.

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive

You will also be eligible for equity and .