Your Role and ResponsibilitiesAs a Compute Operations Site Reliability Engineer, you will perform the following tasks:
- Remotely administer Power Server hardware environments across numerous datacenter locations around the world (currently 18 datacenters and growing).
- Develop automation to reduce manual toil (automated, repetitive tasks) using shell scripts (bash, etc), Python, Ansible, and related tools and languages.
- Perform code stack updates on infrastructure systems (VIOS, firmware, PowerVC, HMC, Novalink, NIM servers) as well as cloud supporting systems (jump servers, sobox, network nodes, gateways, TSM servers).
- Upload/maintain stock images.
- Remotely administer AIX and Linux servers
- Maintain UserIDs (Add/delete) and passwords.
- Monitor daily/weekly backups to ensure they are working.
- Manage and maintain Nagios monitoring environment, troubleshoot scripts/plug-ins if there is an issue.
- Perform periodic LPMs, inactive migrations, or remote restarts of customer VMs to perform system maintenance, balance workloads, or free up resources.
- Monitor and provide details of Capacity utilized in each Datacenter.
- Attend scheduled meetings planned by customer for cutover/maintenance windows.
- Verify capacity requirements in case of provisioning failure issues by customers.
- Work with customers to resolve any RSCT issues so that LPM activities can be performed without impacting customer workloads.
Required Technical and Professional Expertise
- In-depth knowledge of Power Server hardware.
- Significant scripting/coding experience for automating all aspects of IBM Power systems administration.
- Automation using Python, shell scripting (bash, etc), Ansible, and related tools and languages.
- Experience with AIX and Linux administration, commands and networking – – role requires experience at the OS level.
- Strong experience in one or more of the following: VIO, Novalink, and PowerVC. Familiarity with one more (to include installation, configuration, administration).
- In-depth knowledge of PowerVM including installation/configuration and administration.
- High level knowledge of Power Systems supported Operating Systems (AIX and IBMi).
- In-depth knowledge of how storage is connected and allocated to Power systems via NPIV connections.
- Good understanding of Power Systems network configuration at the system level.
Preferred Technical and Professional Expertise
- Experience with configuring and tuning PowerVC
- Experience training new personnel on tooling and processes
- Storage & Power RTS, MVS Network for Cisco, Juniper; general support skills