Your Impact
As an
AI Infrastructure Abstraction Engineer, you will help shape the next generation of AI compute platforms by designing systems that abstract away hardware complexity and expose logical, scalable, and secure interfaces for AI workloads. Your work will enable multi-tenancy, resource isolation, and dynamic scheduling of GPUs and accelerators at scale — making infrastructure programmable, elastic, and developer-friendly.
You will bridge the gap between raw compute resources and AI/ML frameworks, allowing infrastructure teams and model developers to consume shared GPU resources with the performance and reliability of bare metal, but with the flexibility of cloud-native systems. Your contributions will empower internal and external users to run AI workloads securely, efficiently, and predictably — regardless of the underlying hardware topology.
This role is critical to enabling AI infrastructure that is multi-tenant by design, scalable in practice, and abstracted for portability across diverse platforms.
KEY RESPONSIBILITIES
- Design and implement infrastructure abstractions that cleanly separate logical compute units (vGPUs, GPU pods, AI queues) from physical hardware (nodes, devices, interconnects) .
- Develop runtime services, APIs, and control planes to expose GPU and accelerator resources to users and frameworks with multi-tenant isolation and QoS guarantees .
- Architect systems for secure GPU sharing , including time-slicing, memory partitioning, and namespace isolation across tenants or jobs.
- Collaborate with platform, orchestration, and scheduling teams to map logical resources to physical devices based on utilization, priority, and topology.
- Define and enforce resource usage policies , including fair sharing, quota management, and oversubscription strategies.
- Integrate with model training and serving frameworks (e.g., PyTorch, TensorFlow, Triton) to ensure smooth and predictable resource consumption.
- Build observability and telemetry pipelines to trace logical-to-physical mappings, usage patterns, and performance anomalies.
- Partner with infrastructure security teams to ensure secure onboarding, access control, and workload isolation in shared environments.
- Support internal developers in adopting abstraction APIs, ensuring high performance while abstracting away low-level details.
- Contribute to the evolution of internal compute platform architecture, with a focus on abstraction, modularity, and scalability.
Minimum Qualifications:
- Bachelors + 15 years of related experience, or Masters + 12 years of related experience, or PhD + 8 years of related experience
- Experience building scalable, production-grade infrastructure components or control planes using Go, Python, and C++ ,
- Experience with Kubernetes, Docker or Kubevirt for v irtualization, containerization , and orchestration frameworks
- Experience designing or implementing logical resource abstractions for compute, storage, or networking with a focus in multi-tenant environments .
- Experience integrating with AI/ML platforms or pipelines (e.g., PyTorch, TensorFlow, Triton Inference Server, MLFlow).
Preferred Qualifications:
- Experience with GPU sharing, scheduling, or isolation techniques (e.g., MPS, MIG, time-slicing, device plugin frameworks, or vGPU technologies).
- Solid grasp of resource management concepts including quotas, fairness, prioritization, and elasticity.