Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia System Software Engineer Event Operations 
United States, Texas 
840761298

Yesterday
US, CA, Santa Clara
US, CA, Remote
US, DC, Remote
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you’ll be doing:

  • Develop comprehensive operational plans and de-risking strategies to ensure flawless technical execution of technical training events.

  • Provide expert, hands-on technical leadership during live training events, managing deployments and rapidly resolving emergent issues for an optimal user experience.

  • Oversee the stability, scalability, and reliability of the DLI learning platform, implementing SRE principles and leading incident response for optimal performance and reliability.

  • Lead cross-functional coordination, establish and enforce operational best practices, and drive continuous improvement initiatives to enhance platform services.

What we need to see:

  • Bachelor’s degree in Computer Science, a related technical field, or equivalent experience

  • Over 6 years of DevOps experience optimizing, deploying and running containerized applications (Docker, Kubernetes) across AWS, Azure, and GCP, including hands-on work with EKS, AKS, and GKE.

  • Proficient in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.

  • Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.

  • Demonstrated ability as a meticulous problem-solver with strong analytical skills, capable of diagnosing and resolving complex technical challenges under pressure.

  • Excellent communication, teamwork, and collaboration skills, with an ability to articulate technical concepts clearly to diverse audiences and lead technical responses during incidents.


Ways to stand out from the crowd:

  • Proven experience designing and implementing event-driven architectures using pub/sub patterns with platforms like AWS SNS / SQS, Google Pub / Sub, or Azure Service Bus.

  • Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as Retrieval Augmented Generation (RAG) and vector databases.

  • Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT) for model development, serving, and optimization. Production experience with NVIDIA NIM is a strong plus.

  • Experienced in building and running CI/CD pipelines (Jenkins, GitLab CI) and managed software development environments, applying SRE principles to automate, enhance reliability, and improve performance.

  • Familiarity with Python-based Learning Management Systems (LMS) such as Open edX.

You will also be eligible for equity and .