Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Western Digital Sr HPC Engineer - LSF/Slurm RHEL/CentOS/SUSE Ansible 
United States, Georgia 
453927721

24.07.2024
Company Description

Today’s exceptional challenges require your unique skills. It’s You & Western Digital. Together, we’re the next BIG thing in data.

Job Description

Western Digital’s High-Performance Computing environments are key to bringing new storage solutions to market. As a Senior High-Performance Computing (HPC) engineer in the IT Infrastructure team, you will be at the heart of Western Digital’s engineering and product development process, delivering the IT HPC infrastructure and services that empowers engineering teams to develop new storage technologies and deliver high quality products to market quickly.

What you’ll be doing:

  • Support multi-site, high-performance compute infrastructure and services for the global engineering product development organizations
  • Design, create, deliver, and support the deployment of Ansible automation within HPC and Unix environments
  • Identify and propose solutions and new services for the distributed ASIC and GPU computing clusters
  • Perform troubleshooting and root cause analysis of HPC clusters and file system related issues
  • Develop and maintain documentation for all aspects of the HPC infrastructure
  • Improve root cause analysis and corrective action for problems large and small – identify patterns and propose how we can automate repetitive tasks
  • Recommend and implement solutions to improve the performance of workloads
  • Support diverse Engineering Design Automation environment

Tooling

  • GitHub
  • Terraform, Ansible
  • Splunk, Grafana, Prometheus

Infrastructure

  • OS: RedHat and any related distribution
  • Monitoring tools like nagios/cacti or any equivalent
  • PXE/Kickstart configuration
  • NFS storage management & automounter
  • EDA tool installation and support like Cadence and Synopsys
  • Opensource tool installation and support
  • Unix/Linux authentication with AD
  • Infrastructure automation with scripting knowledge
Qualifications
  • Bachelor’s degree in computer science or equivalent experience
  • 10+ years o f Linux systems administration experience specifically in managing or supporting RedHat and/or Centos Linux in production environments
  • Experience with configuration management tools: Ansible, Puppet, Chef
  • Experience with automation
  • Ability to technically lead a project through the lifecycle
  • Scripting skills: highly skilled in at least two typical scripting languages (shell/bash, python, ruby)
  • Excellent problem-solving, multitasking, troubleshooting skills, and attention to detail are required to work in this challenging and dynamic environment
  • Very strong interpersonal, customer service, result-oriented, and team-building skills