Our Culture:
Who you are:
As a Site Reliability Engineering (SRE) and DevOps Engineer in Storage, you will ensure that the designed solution responds to non-functional requirements such as reliability, availability, performance, security, and maintainability. You will closely work with the development and other related Release and L2 teams.
- You will bring a strong engineering focus to operations, putting your energy on preventing incidents, increasing observability, automation frameworks, self-service infrastructure, logging and metrics, and operational reports.
- You will be expected to use tools include logging, monitoring, event management, notification, Runbook Automation, ChatOps, Root Cause Analysis.
- You will work with Automation Engineers and QA Engineers to ensure seamless delivery of our service offerings.
· Build sufficient expertise in the IBM Cloud control plane (IMS) to create automated monitoring processes
Responsibilities:
- Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs
- Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
- Writing, updating, and using documentation, including runbooks/playbooks
- Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
- Debugging complex problems across an entire stack and creating solid solutions
- Developing CI/CD processes to improve cadence
- Persistent testing of application and infrastructure resiliency over a variety of error conditions.
- Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
- Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks.
- Standup and maintain pre-production and developer environments to support the entire development organization and improve overall team velocity
- Use metrics and analytics to determine reliability issues and remove them through automation and tooling
- Be an advocate for our customers, providing them self-diagnosing tools to resolve common issues that arise in the field
Required Professional and Technical Expertise
· 4+ yrs of total experience
· A solid understanding of Cloud infrastructure/operations is a must
- Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals
- Experience debugging complex problems
- Experience designing, building, and operating large-scale production systems
· Expertise in Ansible, Bash, core Python development
· Strong familiarity with one of C, C++, golang, python, or Java
- Experience with DevOps engineering or SRE
- Experience with containers, such as with Docker, Kubernetes
- Experience with standard industry tools for monitoring and observability like Prometheus and Grafana
- Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
- A strong understanding of diverse infrastructure platforms and infrastructure concepts required.
- Has hands-on experience using source control and feature branching strategies
- Understands networking and messaging, especially between services
· Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration, Incident management and support
· Strong communication skills