Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Principal Engineer System Software Platform Engineering 
Vietnam, Thái Nguyên Province, Thái Nguyên 
137731571

26.08.2025
Vietnam, Ho Chi Minh City
Vietnam, Hanoi
time type
Full time
posted on
Posted 4 Days Ago
job requisition id

NVIDIA Vietnam R&D Center is an integral part of NVIDIA global network of world class Engineers and Researchers. To help push the boundary of Accelerated Computing, we’re seeking a hands-on technical leader to architect, build, and operate a platform for AI inference and agentic applications. You’ll focus on heterogeneous compute (with a strong GPU emphasis), reliability, security, and developer experience across cloud and hybrid environments.

What you will do:

  • Build and operate the platform for AI: multi-tenant services, identity/policy, configuration, quotas, cost controls, and paved paths for teams.

  • Lead inference platforms at scale, including model-serving routing, autoscaling, rollout safety (canary/A-B), ensuring reliability, and maintaining end-to-end observability.

  • Operate GPUs in Kubernetes: lead NVIDIA device plugins, GPU Feature Discovery, time-slicing, MPS, and MIG partitioning; implement topology-aware scheduling and bin-packing.

  • Lead GPU lifecycle:driver/firmware/Runtime

  • Enable virtualization strategies: vGPU (e.g., on vSphere/KVM), PCIe passthrough, mediated devices, and pool-based GPU sharing; define placement, isolation, and preemption policies.

  • Build secure traffic and networking: API gateways, service mesh, rate limiting, authN/authZ, multi-region routing, and DR/failover.

  • Improve observability and operations through metrics, tracing, and logging for DCGM/GPUs, runbooks, incident response, performance, and cost optimization.

  • Establish platform blueprints: reusable templates, SDKs/CLIs, golden CI/CD pipelines, andinfrastructure-as-codestandards.

  • Lead through influence: write design docs, conduct reviews, mentor engineers, and shape platform roadmaps aligned to AI product needs.

What we need to see:

  • 15+ years building/operating large-scale distributed systems or platform infrastructure; strong record of shipping production services.

  • Proficiency in one or more of Python/Go/Java/C++; deep understanding of concurrency, networking, and systems design.

  • Containers/orchestration/Kubernetesexpertise, cloudnetworking/storage/IAM,andinfrastructure-as-code.

  • Practical GPU platform experience: Kubernetes GPU operations (device plugin, GPU Operator, feature discovery),scheduling/bin-packing,isolation, preemption, utilization tuning.

  • Virtualization background: deploying and operating vGPU, PCIe pass-through, and/or mediated devices in production.

  • SRE or equivalent experience: SLOs/error budgets, incident management, performance tuning, resource management, and financial oversight.

  • Security-first mentality: TLS/mTLS, RBAC, secrets, policy-as-code, and secure multi-tenant architectures.

Ways to stand out from a crowd:

  • Deep GPU ops: MIG partitioning, MPS sharing, NUMA/topology awareness, DCGM telemetry, GPUDirect RDMA/Storage.

  • Inference platform exposure: serving runtimes, caching/batching, autoscaling patterns, continuous delivery (agnostic to specific stacks).

  • Agentic platform exposure: workflow engines, tool orchestration, policy/guardrails for tool access and data boundaries.

  • Traffic/data plane: gRPC/HTTP/Protobuf performance, service mesh, API gateways, CDN/caching, global traffic management.

  • Tooling:Terraform/Helm/GitOps,Prometheus/Grafana/OpenTelemetry,policy engines; bare-metal provisioning experience is a plus.