Share
What you’ll be doing:
Building systems to support the maintenance, scaling, and operation of diverse, global compute platforms spanning multiple cloud providers.
Driving continuous cost optimization for compute resources, focusing on efficiency and expenditure management.
Designing and implementing flexible solutions to ensure adequate compute capacity and resource availability, support diverse workload requirements and new compute initiatives, and meet fluctuating demands.
Building, maintain, and optimize orchestration functions by mapping workload requirements to cloud provider capabilities, implementing workers, and refining job queue and scaling systems.
Managing and maintaining artifacts to establish a consistent baseline compute capability across all supported cloud providers and regions.
What we need to see:
Bachelor’s degree in Computer Science, a related technical field, or equivalent experience.
8+ years of DevOps experience optimizing, deploying, and running heterogeneous containerized applications (Docker, Kubernetes) across trust boundaries, on AWS, Azure, and GCP, including hands-on work with EKS, AKS, and GKE.
Practical experience in building scalable, reliable services and distributed system integration topologies
Hands-on experience maintaining AWS security groups, roles, IAM, and role delegation.
Proficiency in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.
Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.
Demonstrated ability as a meticulous problem-solver with strong analytical skills, capable of rapidly diagnosing and resolving complex technical challenges.
Excellent communication, teamwork, and collaboration skills, with an ability to articulate technical concepts clearly to diverse audiences and lead technical responses during incidents.
Ways to stand out from the crowd:
Proven experience with event-driven architectures using pub/sub patterns (e.g., AWS SNS/SQS, Google Pub/Sub, Azure Service Bus).
Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as RAG and vector databases.
Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT), with Production experience with NVIDIA NIM as a strong plus.
Experienced in building and running CI/CD pipelines (Jenkins, GitLab CI) and applying SRE principles to automate, enhance reliability, and improve performance.
Familiarity with Python-based Learning Management Systems (LMS) such as Open edX as well as practical experience with highly heterogeneous compute deployments.
You will also be eligible for equity and .
These jobs might be a good fit