Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Compute Cluster Deployment Engineer 
Israel, North District 
729080118

31.07.2024

What you will be doing:

  • Primary responsibilities will include managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.

  • Support operational and reliability aspects of large scale AI clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

  • Be part of an on call rotation to support production systems

What we need to see:

  • 5+ years providing in-depth support and deployment services, solving problems for hardware and software products.

  • Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, bootprocedures/troubleshooting,performancereporting/optimization/logging,network-routing/advancednetworking (tuning and monitoring).

  • Cluster management technologies, EX: Bright Cluster Manager

  • Scripting proficiency.

  • Good social skills with the ability to maintain and deliver resolutions for customer blocking issues as they arise.

  • Superb communication and presentation/oral skills.

  • Excellent verbal and written English skills.

  • Strong organizational skills and ability toprioritize/multi-taskeasily with limited supervision.

  • Candidates should have a minimum of a four-year degree from an accredited university or college in Computer Science, or Electrical or Computer Engineering.

  • Industry-standard Linux certifications.

Ways to stand out of a crowd:

  • InfiniBand experience.

  • Experience with GPU focused hardware/software.

  • Experience with MPI.

  • Automation tooling background (Ansible, Salt, Puppet etc.).

  • Ethernet and Storage technologies.