The point where experts and best companies meet
Share
In this Site Reliability and Automation Engineer role, you will work closely with the Data Center, the entire Cloud development organization and IBM vendors to support, maintain and operationally improve the cloud infrastructure. Your focus will be the following key responsibilities:
• Automate health monitoring of the production and test systems
• Automate return to service procedures for Cloud Platform Components
• Support the compliance and security integrity of the environment through your work
• Partner with other teams, functional managers and program managers to deliver mission-critical services to the market
• Support development of new and existing capabilities for our compute, storage and network services
• Integrate automation with operational requirements
• Work with Engineering to:
o Define operational requirements
o Automate operational requirements
o Participate in the full deployment pipeline
• Work with Support and Development to:
o Identify and resolve issues
o Discuss and plan integration requirements
Extensive experience in hands-on production administration of large system environments, including virtual platforms.
• Experience in establishing, following, and improving operational procedures within a mission critical environment
• Experience in data center infrastructure or relevant work experience
• Experience in large-scale infrastructure design, engineering, and support
• Experience in IT Change, Incident, Problem, Asset management
• 5+ years of infrastructure engineering with proven record for delivering high-quality, large-scale solutions. Experience designing architectures for scale and performance
• Must be efficient in writing, debugging and maintaining scripts (Bash and Python)
• Must be extremely comfortable using and navigating within a Linux environment
• Ability to do low level debugging and problem analysis by examining logs and running Unix commands
• 2-3 years of extensive experience with open-source products
• 3-5 years of experience with configuration management systems (Ansible / Chef)
• Hands on knowledge of using Splunk or ELK
• Must have the ability to perform debugging and problem analysis by examining logs and running Unix commands
• Working knowledge with Network and Storage technologies
• Working knowledge with ServiceNow, JIRA, Confluence, and GitHub
• Excellent written and verbal communication skills
• Comfortable operating in fast paced environment
Preferred Technical and Professional Expertise
These jobs might be a good fit