Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Solution Architect HPC AI - NVIS 
United States, Texas 
475924318

01.12.2024

What You’ll Be Doing:

  • Primary responsibilities will include building and enabling robust AI/HPC infrastructure for customers

  • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, training stability, real-time monitoring, logging, and alerting

  • Engage in and improve services from inception and design through deployment, operation, and optimization

  • Co-design telemetry of AI workloads to help engineering build solutions for more robust workloads at scale

  • Communicate across internal teams to support the continuous improvement of NVIDIA's offerings and software designs

What We Need to See:

  • Strong foundational expertise, from a BS, MS, or Ph.D. degree in Engineering, Mathematics, Physics, Computer Science, Data Science, or similar (or equivalent experience).

  • 8+ years of experience and knowledge of neural networks including good understanding of transformer architectures. Experience designing large scale AI workloads with SLURM and/or Kubernetes

  • Proficiency with Python / C++ / Rust or other popular software languages

  • Excellent verbal, written communication, and technical presentation skills in English

  • You are motivated to work with multiple levels and teams across organizations

  • Strong analytical and problem-solving skills

  • Strong time-management and organization skills for coordinating multiple initiatives, priorities and implementations of new technology and products into very sophisticated projects

  • You are a curious self-starter with a desire for continuous learning and sharing knowledge across the team

Ways to Stand Out from The Crowd:

  • Experience orchestrating distributed Deep Learning training with SLURM

  • Proficiency in DevOps, including hands-on experience with Ansible, Terraform or similar tools. Equivalent experience will be accepted as well.

  • 8+ years designing solutions with one or more Tier-1 Clouds (AWS, Azure, GCP or OCI) and cloud-native architectures and software

  • Technical leadership with a strong understanding of NVIDIA technologies, and success in working with customers

  • Expertise with parallel file systems (e.g. Lustre, GPFS, BeeGFS, WekaIO) and high-speed interconnects (InfiniBand, Omni Path, and Gig-E)

  • Experience with integration and deployment of software products in production enterprise environments, and microservices software architecture

You will also be eligible for equity and .