Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

IBM SRE - Backup & Recovery Storage DevOps 
Costa Rica 
800874627

27.03.2025

Our Culture:

As a
Site Reliability Engineer (SRE)in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers, and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability.

Your role and responsibilities

Reliability & Scalability

· Design, build, and maintain highly available, distributed services with a focus on scalability, security, and performance.

· Develop and implement Kubernetes and OpenShift-based solutions to manage containerized applications at scale.

· Create auto-scaling, load balancing, and failover strategies to ensure seamless service availability.

Monitoring & Observability

· Design, implement, and manage monitoring solutions to gain insights into system health and performance.

· Create and maintain intuitive dashboards that provide real-time visibility into critical metrics.

· Set up proactive alerting mechanisms to detect and resolve issues before they impact end users.

Automation & Infrastructure as Code (IaC)

· Develop robust automation scripts using tools such as Terraform and Ansible to simplify infrastructure management.

· Automate repetitive operational tasks to improve system reliability and reduce manual effort.

· Implement CI/CD pipelines for deploying applications on Kubernetes and OpenShift environments.

Incident Management & Troubleshooting

· Respond to alerts, incidents, and outages with a focus on minimizing downtime and restoring services efficiently.

· Conduct thorough Root Cause Analysis (RCA) for critical issues and implement long-term solutions to prevent recurrence.

Disaster Recovery & High Availability

· Design and implement backup and recovery strategies.

· Perform BCDR (Business Continuity and Disaster Recovery) simulations.

· Ensure data redundancy, failover strategies, and failback mechanisms.

Security & Compliance

· Ensure compliance with security best practices and regulatory requirements.

· Implement secret management, encryption, and access control for sensitive infrastructure components.

· Participate in security audits, vulnerability assessments, and compliance automation efforts.

· Work closely with development, operations, and security teams to design and implement resilient architectures.

· Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews.

Required education
Bachelor's Degree
Preferred education
Bachelor's Degree
Required technical and professional expertise

Technical Skills

· Programming Languages: Go, Python, Bash, or other scripting languages

· Cloud & Infrastructure: Kubernetes, Docker, Terraform, IBM Cloud, AWS, or other cloud providers

· CI/CD & Automation: GitHub Actions, Jenkins, Ansible

· Monitoring & Logging: IBM Cloud Monitoring tools, Prometheus, Grafana

Required Experience:

· Experience in SRE, DevOps, or Software Engineering roles.

· An understanding of Cloud infrastructure/operations is a must.

· Proficiency in Kubernetes (certifications such as CKA or CKS are a plus).

· Strong experience with OpenShift for managing containerized applications.

· Proficiency in Go, Python, or Bash for automation and tool development.

· Deep understanding of Linux internals, system administration, and troubleshooting.

· Experience in building and managing infrastructure with Terraform, Ansible, or similar IaC tools.

· Expertise in CI/CD tools such as Jenkins and GitHub Actions.

· Expertise in logging and monitoring tools to ensure system observability and performance.

· Strong knowledge of networking concepts, firewalls, and security best practices in Kubernetes.

Required Education

· Bachelor’s degree in computer science engineering/information technology

Required Experience

· 3-4 years

Being an IBMer means you’ll be able to learn and develop yourself and your career, you’ll be encouraged to be courageous and experiment everyday, all whilst having continuous trust and support in an environment where everyone can thrive whatever their personal or professional background.

OTHER RELEVANT JOB DETAILS

For additional information about location requirements, please discuss with the recruiter following submission of your application.