Finding the best job has never been easier
Share
Day-Day activities:
Attention to detail is required
Explore & coordinate modern HPC software adoption into KLA’s tools.
Documentation – Ability to produce detailed documentation such asproposals/architecture/build/test/assemblydocs, Visio drawings
Strong fundamental understanding and knowledge of Servers, GPUs, HW /SW based networking, and processing infrastructures.
Align / coordinate with appropriate vendors for hardware roadmaps, equipment purchasing, shipping, RMA’s
Schedule and status HPC system programs development which includes requirements, concept/detailed design, implementation,evaluation/validation,regulatory and packaging testing and release activities.
Design, Test, Install and validate hardware (Servers, GPUs, Networking, packaging) for early-stage evaluation and prototyping.
Own CI/CD pipelines that ensure stability of system-level HPC software deployments
Support and drive improvements to in house and customer solutions to reach the highest level of system uptime. Travel (10%) to root cause and understand the issues firsthand
Your Encouraged Background
Proven experience designing and developing high performance computing (HPC) hardware/software infrastructure for AI applications.
Deep understanding of Server and networking hardware, operating systems, and high-performance applications for high speed data IO and storage.
10+ years of previous experience generating custom OS images for Linux servers
Strong Scripting Skills (Bash, Python)
Proven understanding of Kubernetes and Docker or other container based systems
Experience with system management and monitoring tools such as Prometheus, Grafana, collected (and its plugins)
Ability to work well with developers & test engineers
Passion to maintain a stable high-quality cluster amidst all odds!
Minimum Qualifications
Bachelor's Level Degree and related work experience of 12 years and Master's Level Degree and related work experience of 8 years
These jobs might be a good fit