The point where experts and best companies meet
Share
What You'll be Doing:
A huge part of the day-to-day job is collaborating with partners to develop programs driving around storage, networking, and compute in our growing fleet of data centers.
Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers, in support of the chip design teams
Ensure the highest reliability of HPC clusters. Develop critical metrics, program schedules to measure program health, predictability, and achievements
Identify failures, lead retrospective analysis, and help to develop improvement action plans. Build standard methodologies that cut through complexity and can be used across Nvidia and influence other partners for continuous improvement
Evaluate the latest technologies (hardware and cloud computing) and recommend future evolution of the infrastructure. Plan deployments and refresh of hardware (compute, storage, network equipment), and associated software stack (e.g. OS)
Work multi-functionally with hardware engineering leaders to support their future chip design needs, understand their workflow characteristics, and engineer an efficient HPC environment. Work with IT and engineering infrastructure teams on the different subsystems that comprise the computing environment.
Lead all aspects of the HPC scheduler (LSF), set/adjust policy, ensure delivery of forecasted compute demand to each hardware division, and drive high utilization.
Track software licensing servers and drive efficient license utilization
Develop and manage program schedules, milestones and deliverables. Adjust in the face of a highly fluid customer product roadmap.
Regularly communicate program status and key issues to senior management at NVIDIA’s headquarters. Accurately represent the importance of issues and call out issues appropriately. Be the evangelist of data driven project management
What We Need to See:
B.S. or M.S. in Computer Science, Computer Engineering, Information Science (or equivalent experience)
15+ years overall
5+ years managing IT infrastructure teams of 10+ people
10+ years experience running Linux servers, NFS storage, and Ethernet networks
Knowledge of HPC schedulers (IBM LSF preferred)
Knowledge of hardware design workflows (EDA tools and methodology)
Experience using project management and capacity planning software
Datacenter operations (rack and stack, maintenance)
Ways to stand out from the crowd:
HPC storage (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon)
Infiniband (operations, debugging, performance tuning)
Software development, especially in a devops context
Knowledge of relational databases, data lakes,metrics/visualization/analyticsplatforms
Deploying and maintaining FlexLM-based software license servers
Established relationships with enterprise-level equipment suppliers
You will also be eligible for equity and .
These jobs might be a good fit