Share
*Visa sponsorship providedAs a Senior Technical Account Manager (Sr. TAM) specializing in GPU Optimization in AWS Enterprise Support, you will play a crucial role in two key missions: guiding customers' GPU acceleration initiatives across AWS's comprehensive compute portfolio, and spearheading the development of optimization strategies that revolutionize customer workload performance.Key Job Responsibilities
- Design and optimize GPU resource usage on EC2/EKS/SageMaker or equivalent cloud compute, container, and ML services; implement node pool tiering, Karpenter/Cluster Autoscaler tuning, auto scaling, and cost governance (Savings Plans/RI/Spot/ODCR or equivalent).
- Drive GPU partitioning and multi-tenant resource sharing strategies to reduce idle resources and increase overall cluster utilization.- Build GPU observability and monitoring systems (nvidia-smi, CloudWatch or equivalent monitoring tools, profilers, distributed communication metrics) to align capacity planning with SLOs.
- Ensure compatibility across GPU drivers, CUDA, container runtimes, and frameworks; standardize change management and rollback processes.
Diverse Experiences
AWS values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.
Mentorship & Career Growth
We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.Work/Life Balance
- 5+ years in cloud technical support, solutions architecture, or customer success management, with at least 3 years of hands-on experience in GPU/accelerated computing platforms.
- In-depth understanding of GPU instance families (e.g., AWS G/P/H series) or similar offerings from other cloud providers, AMI/driver/CUDA/container compatibility management, and cloud storage/network performance tuning (e.g., S3 I/O, EBS/Instance Store equivalents, preprocessing pipelines). Proficient in scheduling GPU workloads with EKS or equivalent Kubernetes-based orchestration services, including node pool tiering, resource quotas, elastic scaling, and auto-recovery strategies. Experienced in multi-GPU/multi-node distributed computing (NCCL, topology awareness, tensor parallelism, pipeline parallelism) with expertise in communication optimization for large-scale AI training and inference.
- Skilled in PyTorch/TensorFlow performance analysis and optimization, including DataLoader tuning, mixed precision, operator fusion, and inference acceleration toolchains (ONNX, TensorRT, CUDA Graphs).
- Experienced in cost and capacity governance, familiar with Savings Plans, RI, ODCR, Spot, Capacity Blocks, and right-sizing strategies or their equivalents in other cloud platforms.
- Demonstrated cross-functional communication and influence skills, capable of driving technical solutions with data and business objectives.
- AWS Solutions Architect Professional, Machine Learning Specialty, or DevOps Professional certification or equivalent credentials from other cloud providers.
- Hands-on experience with NVIDIA ecosystem software and toolchains (CUDA/cuDNN/NCCL, TensorRT, CUDA Graphs) and proven ability to maintain performance consistency across versions and platforms.
- Delivered quantifiable performance improvements (GPU throughput, latency reduction, cost savings) with demonstrated benchmarking and regression testing methodology.
- Proven repeatable optimization results in LLM inference, batch AI training, real-time video processing, or high-performance computing (HPC).
- Contributions to open source projects (Run:ai, Ray, vLLM, DeepSpeed, Kubeflow, etc.) or published technical articles, whitepapers, or performance benchmarking.
- Experience with Infrastructure as Code (Terraform, AWS CDK **or equivalent cloud development frameworks**), Helm Charts, baseline container image management, and DevOps automation.
- Able to present performance-business tradeoffs and results to senior stakeholders using PR/FAQ documents, architecture diagrams, and capacity/cost reports.
These jobs might be a good fit