Our Culture:
As a
Site Reliability Engineer (SRE)in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers, and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability.
Reliability & Scalability
· Design, build, and maintain highly available, distributed services with a focus on scalability, security, and performance.
· Develop and implement Kubernetes and OpenShift-based solutions to manage containerized applications at scale.
· Create auto-scaling, load balancing, and failover strategies to ensure seamless service availability.
Monitoring & Observability
· Design, implement, and manage monitoring solutions to gain insights into system health and performance.
· Create and maintain intuitive dashboards that provide real-time visibility into critical metrics.
· Set up proactive alerting mechanisms to detect and resolve issues before they impact end users.
Automation & Infrastructure as Code (IaC)
· Develop robust automation scripts using tools such as Terraform and Ansible to simplify infrastructure management.
· Automate repetitive operational tasks to improve system reliability and reduce manual effort.
· Implement CI/CD pipelines for deploying applications on Kubernetes and OpenShift environments.
Incident Management & Troubleshooting
· Respond to alerts, incidents, and outages with a focus on minimizing downtime and restoring services efficiently.
· Conduct thorough Root Cause Analysis (RCA) for critical issues and implement long-term solutions to prevent recurrence.
Disaster Recovery & High Availability
· Design and implement backup and recovery strategies.
· Perform BCDR (Business Continuity and Disaster Recovery) simulations.
· Ensure data redundancy, failover strategies, and failback mechanisms.
Security & Compliance
· Ensure compliance with security best practices and regulatory requirements.
· Implement secret management, encryption, and access control for sensitive infrastructure components.
· Participate in security audits, vulnerability assessments, and compliance automation efforts.
· Work closely with development, operations, and security teams to design and implement resilient architectures.
· Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews.
Technical Skills
· Programming Languages: Go, Python, Bash, or other scripting languages
· Cloud & Infrastructure: Kubernetes, Docker, Terraform, IBM Cloud, AWS, or other cloud providers
· CI/CD & Automation: GitHub Actions, Jenkins, Ansible
· Monitoring & Logging: IBM Cloud Monitoring tools, Prometheus, Grafana
Required Experience:
· Experience in SRE, DevOps, or Software Engineering roles.
· An understanding of Cloud infrastructure/operations is a must.
· Proficiency in Kubernetes (certifications such as CKA or CKS are a plus).
· Strong experience with OpenShift for managing containerized applications.
· Proficiency in Go, Python, or Bash for automation and tool development.
· Deep understanding of Linux internals, system administration, and troubleshooting.
· Experience in building and managing infrastructure with Terraform, Ansible, or similar IaC tools.
· Expertise in CI/CD tools such as Jenkins and GitHub Actions.
· Expertise in logging and monitoring tools to ensure system observability and performance.
· Strong knowledge of networking concepts, firewalls, and security best practices in Kubernetes.
Required Education
· Bachelor’s degree in computer science engineering/information technology
Required Experience
· 3-4 years