Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior HPC Solutions Architect AI 
Taiwan, Taiwan Province, Hsinchu 
39790515

02.05.2024

What you will be doing:

  • Primary responsibilities will include managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.

  • Support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

  • Be part of an on call rotation to support production systems

What we need to see:

  • 8+ years providing in-depth support and deployment services, solving problems for hardware and software products.

  • Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, bootprocedures/troubleshooting,performancereporting/optimization/logging,network-routing/advancednetworking (tuning and monitoring).

  • HPC/AI Cluster management technologies EX: Bright Cluster Manager

  • Minimum of a four-year degree from an accredited university or college or equivalent experience in Computer Science, or Electrical or Computer Engineering.

  • Scripting proficiency(Bash, Ansible, etc).

  • Good interpersonal skills with the ability to maintain and deliver resolutions for customer blocking issues as they arise.

  • Strong organizational skills and ability toprioritize/multi-taskeasily with limited supervision.

  • Experience with HPC/AI Schedulers, primarily Kubernetes, with consideration for Slurm, LSF, etc.

Way to stand out from crowd:

  • InfiniBand experience.

  • Experience with GPU focused hardware/software.

  • Experience with MPI.

  • Automation tooling background (Ansible, Salt, Puppet etc..

  • Ethernet and Parallel Storage technologies