Western Digital’s High-Performance Computing environments are key to bringing new storage solutions to market. As a Senior High-Performance Computing (HPC) engineer in the IT Infrastructure team, you will be at the heart of Western Digital’s engineering and product development process, delivering the IT HPC infrastructure and services that empowers engineering teams to develop new storage technologies and deliver high quality products to market quickly.
What you’ll be doing:
- Support multi-site, high-performance compute infrastructure and services for the global engineering product development organizations
- Design, create, deliver, and support the deployment of Ansible automation within HPC and Unix environments
- Identify and propose solutions and new services for the distributed ASIC and GPU computing clusters
- Perform troubleshooting and root cause analysis of HPC clusters and file system related issues
- Develop and maintain documentation for all aspects of the HPC infrastructure
- Improve root cause analysis and corrective action for problems large and small – identify patterns and propose how we can automate repetitive tasks
- Recommend and implement solutions to improve the performance of workloads
- Support diverse Engineering Design Automation environment
Tooling
- GitHub
- CI/CD (Jenkins, Terraform, Ansible)
- Splunk, Grafana, Prometheus
Infrastructure
- Kubernetes/Open Shift
- Cloud Computing (AWS Cloud, Google, Azure)
- Cloud Storage Systems (S3, FSx, CVO)
- OS: RedHat and any related distribution
- Containers (Singularity/Docker)