Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia System Software Engineer Platform Compute 
United States, Texas 
488408580

Yesterday
US, CA, Santa Clara
US, CA, Remote
US, DC, Remote
time type
Full time
posted on
Posted 8 Days Ago
job requisition id

What you’ll be doing:

  • Building systems to support the maintenance, scaling, and operation of diverse, global compute platforms spanning multiple cloud providers.

  • Driving continuous cost optimization for compute resources, focusing on efficiency and expenditure management.

  • Designing and implementing flexible solutions to ensure adequate compute capacity and resource availability, support diverse workload requirements and new compute initiatives, and meet fluctuating demands.

  • Building, maintain, and optimize orchestration functions by mapping workload requirements to cloud provider capabilities, implementing workers, and refining job queue and scaling systems.

  • Managing and maintaining artifacts to establish a consistent baseline compute capability across all supported cloud providers and regions.

What we need to see:

  • Bachelor’s degree in Computer Science, a related technical field, or equivalent experience.

  • 8+ years of DevOps experience optimizing, deploying, and running heterogeneous containerized applications (Docker, Kubernetes) across trust boundaries, on AWS, Azure, and GCP, including hands-on work with EKS, AKS, and GKE.

  • Practical experience in building scalable, reliable services and distributed system integration topologies

  • Hands-on experience maintaining AWS security groups, roles, IAM, and role delegation.

  • Proficiency in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.

  • Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.

  • Demonstrated ability as a meticulous problem-solver with strong analytical skills, capable of rapidly diagnosing and resolving complex technical challenges.

  • Excellent communication, teamwork, and collaboration skills, with an ability to articulate technical concepts clearly to diverse audiences and lead technical responses during incidents.

Ways to stand out from the crowd:

  • Proven experience with event-driven architectures using pub/sub patterns (e.g., AWS SNS/SQS, Google Pub/Sub, Azure Service Bus).

  • Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as RAG and vector databases.

  • Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT), with Production experience with NVIDIA NIM as a strong plus.

  • Experienced in building and running CI/CD pipelines (Jenkins, GitLab CI) and applying SRE principles to automate, enhance reliability, and improve performance.

  • Familiarity with Python-based Learning Management Systems (LMS) such as Open edX as well as practical experience with highly heterogeneous compute deployments.

You will also be eligible for equity and .