About The Job:In this role, you will assist to the Architectural Design, Deployment as well as Support an HPC Cluster as we bring them from Infancy to the Enterprise. You will identify and help our development requirements and uncover solutions, recommend, plan and drive these solutions to Production.
Responsibilities for this will include:- Design, implementation & support of high-performance compute clusters
- Work with engineering teams to identify the hardware systems that will support different GPU cards.
- Implementation of parallel file systems.
- Apply their attention to detail to generate HW BOMs for HPC Clusters, provide vendor management and coordinate HW release activities.
- Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system.
- Understand and assemble the project specifications and performance requirements at the subsystem and system levels. Adhere and aim to project timelines to ensure program achievements complete on time.
- Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team.
- Handle EOL Parts Re-Qualification for long term system deployments.
- Support in-house as well as in-field critical issues.
Required Qualifications:- Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
- Solid understanding and implementation of parallel file systems;such as Lustre, GPFS, BeeGFS, etc.
- Strong HPC hardware knowledge, particularly in servers, GPUs, networking (InfiniBand), storage, BIOS, and BMC.
- Experience in System-D, Net boot/PXE, Linux HA.
- Strong understanding of TCP/IP fundamentals and knowledge of protocols, DNS, DHCP, HTTP, LDAP, SMTP.
- Ability to code and develop Shell and Python scripts.
- Experience with one or more of the listed Configuration Mgmt utilities. (Salt, Chef, Puppet etc).
Preferred Qualifications:- Possess a strong DevOps focus: Knowledge of setting up a continuous development pipeline (Jenkins), Repository software (Git-based), Singularity, Docker Containers, kickstart install, content manager & Azure repos and pipeline.
- Prometheus & Grafana experience
- Knowledge of Apache/Nginx.
- BS or MS degree + 3 to 5 years validated experience.
- Computer Science or Compute Engineer related fields
Skills and Abilities:- Team Orientation & Interpersonal – Highly motivated teammate with ability to develop and maintain collaborative relationships with all levels within and external to the organization.
- Self-motivated: ability to start tasks independently.
- Organization & Time Management – Able to plan, schedule, organize, and follow up on tasks related to the job to achieve goals within or ahead of established time frames.
- Multi-task - Ability to expeditiously organize, coordinate, manage, prioritize, and perform multiple tasks simultaneously to swiftly assess a situation, resolve a logical course of action, and apply the appropriate response.
- Adaptability to Change – Able to be flexible and supportive, and able to assimilate change positively and proactively in rapid growth environment.
- Outstanding teammate with excellent written and verbal communications skills.
Minimum Qualifications
Master's Level Degree and related work experience of 3 years; Bachelor's Level Degree and related work experience of 5 years