Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Engineering Manager Internal GPU HPC Computing Clusters 
India, Karnataka, Bengaluru 
410071968

31.08.2025
India, Bengaluru
time type
Full time
posted on
Posted 4 Days Ago
job requisition id

What you'll be doing:

  • Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions.

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. Improving reliability and overall Researcher Productivity.

  • Architect and implement brand new strategies to optimize the utilization of our AI computing clusters, driving operational efficiency and resource maximization.

  • Pioneer innovative solutions to streamline support processes, enabling our team to manage an unprecedented scale of GPU resources (10,000+ GPUs per support personnel).

  • Lead the charge in building a future-proof AI computing infrastructure, ensuring seamless scalability and resilience to power groundbreaking AI models and applications.

  • Collaborate with multi-functional teams to identify bottlenecks and opportunities for optimization, continuously improving the performance and cost-effectiveness of our AI computing operations.

  • Empower your team with the tools, processes, and standard methodologies necessary to thrive in a dynamic, high-intensity environment, fostering a culture of operational excellence and continuous improvement

What we need to see:

  • Bachelor’s degree or equivalent experience in Computer Science, Electrical Engineering or related field or similar area.

  • Minimum 4 years of experience leading AI/ML and software development teams as a people manager with 10+ overall years of relevant experience.

  • Consistent track record of leading high-performance teams in delivering innovative solutions to complex computational challenges, with a demonstrated ability to drive operational excellence and continuous improvement.

  • Exceptional problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions that can accommodate the ever-increasing demands of AI computing.

  • Shown leadership capabilities, with the ability to inspire and motivate multi-functional teams, fostering a culture of collaboration, innovation, and steadfast pursuit of operational excellence.

  • Strong communication and collaborator management skills, enabling you to effectively articulate technical concepts to diverse audiences and align priorities across the organization.

  • A passion for pushing the boundaries of what's possible in AI computing, with an aim to continuously explore and implement emerging technologies and standard processes to maintain our competitive edge.

  • Strong people management and team-building skills. Can coach and grow talent, cultivate healthy engineering culture, and attract/retain talent. Ability to build a diverse, broad, and impactful team.

Ways to stand out from the crowd:

  • Experience with Machine Learning and Deep Learning concepts, algorithms and models

  • Familiarity with InfiniBand with IBOP and RDMA

  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads

  • Familiarity with deep learning frameworks like PyTorch and TensorFlow