Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior Full-Stack Software Engineer 
United States, Texas 
256977468

Today
US, CA, Santa Clara
US, WA, Seattle
time type
Full time
posted on
Posted 4 Days Ago
job requisition id

What You’ll Be Doing:

  • Design, develop, and deploy full-stack web applications to support large-scale AI infrastructure operations and workflows

  • Collaborate with AI and ML research teams to identify pain points and deliver tools that accelerate their work

  • Develop APIs, backend services, and UIs to improve visibility, observability, and control over large-scale GPU clusters

  • Develop backend services to manage job schedulers and cluster operations.

  • Define and track metrics that measure efficiency, resiliency, and developer productivity across the platform

  • Drive engineering excellence in testing, CI/CD, code quality, and performance

  • Lead architectural discussions and mentor junior engineers on design and implementation

  • Stay ahead of AI/ML infrastructure trends and drive adoption of best practices within the team

What We Need To See:

  • 8+ years of experience in developing software infrastructure for large scale AI systems.

  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

  • Proficiency with full-stack development: JavaScript (Vue or React), Node.js, Python, and/or Golang, script languages

  • Experience with distributed systems and cloud-native technologies (Docker, Kubernetes, microservices)

  • Familiarity with observability stacks: ELK, OpenSearch, Prometheus, Grafana, or Loki

  • Strong debugging and root cause analysis skills across application and infrastructure layers

  • Experience with large-scale AI training, inference, or data infrastructure services

  • Excellent communication, collaboration, problem solving and a growth mindset

Ways to Stand Out from the crowd:

  • Experience building developer platforms or self-service internal infrastructure tools for efficiency, resiliency, or observability.

  • Hands-on experience as a Machine Learning Engineer (MLE) or deep familiarity with DL frameworks (e.g., PyTorch, TensorFlow, JAX, Ray).

  • Hands-on experience operating at datacenter scale, including GPU cluster debugging and root cause analysis.

  • Experience with MongoDB, Hadoop, or Spark.

You will also be eligible for equity and .