Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior Manager Professional Services HPC Deployment 
United States, Texas 
756165080

24.06.2024

What you will be doing:

  • Directs and supervises the service HPC engineering functions in designing, developing, installing, and validating hardware and software for the Customer AI High-Performance Computing (HPC) systems.

  • Leads, handles, mentors, and builds a very hardworking HPC service engineering team to deliver innovative advances in high-performance computing AI systems.

  • Responsible for leading our HPC projects' planning, implementation, and performance. Improves the integrity of system services bring-up and related by applying groundbreaking technical and operational knowledge to configure and maintain HPC AI network and server platforms.

  • Drives HPC team hardware and software deployment, plans, develops, and deploys procedures for system validation.

  • Lead team activities and drive tests and plans for Customer's HPC AI systems implementations, custom scripts, and testing procedures to ensure operational reliability for the system.

  • Supports the HPC Engineering team, working with other internal collaborators to develop and run a well-rounded strategy for delivering service quality and continuous service improvement. Supports governance for software engineering through the implementation of standards and quality measures.

  • Leads team member development, helping them set and achieve goals for their career growth. Develop an inclusive environment that values team member differences, creating a sense of belonging and appreciation. Chips in to a culture of trust and clarity.

  • Build strong relationships with INVIDIA leaders, customers, partners, and collaborators. Works closely to identify, implement, and support leading NVIDIA's AI solutions engineering, maintaining currency with industry standards and innovations. Provides input around process optimization, department budgeting, and the monitoring and management of resources.

  • Be the domain authority with customers during planning calls through implementation.

What we need to see:

  • 8+ overall years' experience in IT, high-performance computing, or other related field; 3+ years of experience in a management or leadership role

  • Demonstrated expertise in HPC systems design configuration and planning.

  • Proficiency with lowlatency/high-bandwidthinterconnect infrastructure (Infiniband and Ethernet).

  • Expertise with HPC system software clustermanagement/provisioningtools, including job schedulers (Slurm, salt, xCAT).

  • Proficiency with shared and distributed memory parallelism (OpenMP, MPI, NCCL and HPL) and accelerators (GPUs).

  • Strong scripting ability (Bash, Perl, Python, etc.) and experience with programming fundamentals.

  • Expertise with administration, supervising and maintaining secure Linux/Unix operating systems (CentOS, Solaris).

  • Experience establishing processes for maintaining system performance, managing best-in-class standards, and familiarity with cloud computing and container technologies.

  • Ability to understand and work with large, sophisticated systems, identify and resolve problems, handle performance, and troubleshoot network issues related to infrastructure.

  • Expertise with multi-vendor hardware/software management, security, and network/Internet protocols. Strong communication and social skills, with the ability to provide detailed information and high-level summaries to management-level individuals and groups, present the business side of technical topics to non-technical audiences, and develop positive working relationships and strong rapport with team members.

  • Bachelor's degree in computer science, information systems, or a related field or equivalent experience

  • Solid knowledge of HPC storage

  • Exemplary communication and interpersonal skills, with the ability to present the business side of technical topics to non-technical audiences and persuasively and optimally get along with relationships with various stakeholders and diverse individuals and groups

Ways to stand out from the crowd:

  • InfiniBand experience.

  • Experience with GPU-focused hardware/software.

  • Experience with MPI.

  • Automation tooling background (Ansible, Salt, Puppet, etc.).

  • Ethernet and Storage technologies such as Lustre or GPFS.

You will also be eligible for equity and .