Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Solutions Architect - InfiniBand HPC 
United States, Texas 
559357454

01.09.2024

What you will be doing:

  • Primary responsibilities will include deploying, managing, and validating AI/HPC infrastructure in Linux-based environments for new and existing customers.

  • Be the domain expert with customers during planning calls through implementation.

  • Create and handover related documentation and perform knowledge transfers required to support customers as they roll out some of the most sophisticated systems in the world!

  • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

What we need to see:

  • 5+ years providing in-depth support and deployment services; solving problems for hardware and software products.

  • Knowledge and experience with Linux system administration/dev ops, process management, package management, task scheduling, kernel management, boot procedures, troubleshooting, performancereporting/optimization/logging,andnetwork-routing/advancednetworking (tuning and monitoring).

  • Experience in configuring, testing, validating, and issue resolution of LAN and InfiniBand networking, including use of validation tools for InfiniBand health and performance (ibdiag, etc.) and UFM (Unified Fabric Manager.)

  • Experience with benchmarking tools such as HPL, NCCL tests, MLPERF.

  • Scripting proficiency (Bash, Python, Ansible, etc.) and Automation tooling background (Ansible, Puppet, etc.)

  • Familiarity with schedulers such as SLURM, LSF, UGE, etc.

  • Kubernetes experience.

  • Excellent interpersonal communication skills and the ability to deliver resolutions for customer issues as they arise. Strong self-organizational skills and ability toprioritize/multi-taskeasily with limited supervision.

  • A willingness to travel to customer sites within the United States.

  • Minimum of a four-year degree from an accredited university or college in Computer Science, Electrical or Computer Engineering, or equivalent experience.

Ways to stand out from crowd:

  • Cluster management technologies knowledge (bonus credit for BCM (Base Command Manager).)

  • Experience with GPU (Graphics Processing Unit) focused hardware/software.

  • Experience with MPI (Message Passing Interface.)

  • Storage technologies such as Lustre or GPFS.

  • Familiarity with Dell and Supermicro GPU platforms.

You will also be eligible for equity and .