Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior AI HPC Clusters Lead 
United States, Texas 
927447728

18.08.2024

What you'll be doing:

  • Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions.

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. Improving reliability and overall Researcher Productivity.

  • Architect and implement brand new strategies to optimize the utilization of our AI computing clusters, driving operational efficiency and resource maximization.

  • Pioneer innovative solutions to streamline support processes, enabling our team to manage an unprecedented scale of GPU resources (10,000+ GPUs per support personnel).

  • Lead the charge in building a future-proof AI computing infrastructure, ensuring seamless scalability and resilience to power groundbreaking AI models and applications.

  • Collaborate with multi-functional teams to identify bottlenecks and opportunities for optimization, continuously improving the performance and cost-effectiveness of our AI computing operations.

  • Empower your team with the tools, processes, and standard methodologies necessary to thrive in a dynamic, high-intensity environment, fostering a culture of operational excellence and continuous improvement

  • Partner closely with AI Researcher to understand their needs and devise strategies and plans to address their pain points.

What we need to see:

  • Bachelor’s degree or equivalent experience in Computer Science, Electrical Engineering or related field or similar experience.

  • Minimum 6 years of experience leading AI/ML and software development teams with 12+ years of relevant experience.

  • Consistent track record of leading high-performance teams in delivering innovative solutions to complex computational challenges, with a demonstrated ability to drive operational excellence and continuous improvement.

  • Exceptional problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions that can accommodate the ever-increasing demands of AI computing.

  • Shown leadership capabilities, with the ability to inspire and motivate multi-functional teams, fostering a culture of collaboration, innovation, and steadfast pursuit of operational excellence.

  • Strong communication and collaboration skills, enabling you to effectively articulate technical concepts to diverse audiences and align priorities across the organization.

  • A passion for pushing the boundaries of what's possible in AI computing, with an aim to continuously explore and implement emerging technologies and standard processes to maintain our competitive edge.

  • Strong team leadership and team-building skills. Can coach and grow talent, cultivate healthy engineering culture, and attract/retain talent. Ability to lead a diverse, broad, and impactful team.

Ways to stand out from the crowd:

  • Experience with Machine Learning and Deep Learning concepts, algorithms and models

  • Familiarity with InfiniBand with IBOP and RDMA

  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads

  • Familiarity with deep learning frameworks like PyTorch and TensorFlow

You will also be eligible for equity and .