Our Culture:
Who you are:
As a Site Reliability Engineer (SRE) in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability.
How we’ll help you grow:
You’ll have access to all the technical and management training courses to become the expert you want to be.
You’ll learn directly from Senior members/leaders in this field.
Key Responsibilities:
Reliability & Scalability
· Design, build, and maintain highly available, distributed storage services with a focus on reliability, scalability, and security.
· Implement auto-scaling, load balancing, and failover strategies to ensure seamless service availability.
· Analyze performance bottlenecks, optimize system efficiency, and contribute to capacity planning efforts.
Automation & Infrastructure as Code
· Develop infrastructure automation using PHP, Go, Kubernetes, and other cloud-native technologies.
· Implement self-healing mechanisms and automated remediation processes to minimize manual intervention.
Incident Management & Monitoring
· Respond to production incidents, participate on root cause analyses (RCA), and implement long-term fixes to improve system resilience.
· Collaborate on observability solutions, including monitoring, logging, and alerting, using tools like Prometheus, Grafana, Splunk, and IBM Cloud Monitoring.
Security & Compliance
· Ensure compliance with security best practices and regulatory requirements.
· Implement secret management, encryption, and access control for sensitive infrastructure components.
· Participate in security audits, vulnerability assessments, and compliance automation efforts.
· Work closely with development, operations, and security teams to design and implement resilient architectures.
· Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews.
Technical Skills
· Programming Languages: PHP, Go, Python, Bash, or other scripting languages
· Cloud & Infrastructure: Kubernetes, Docker, Terraform, IBM Cloud, AWS, or other cloud providers
· Storage Technologies: NetApp, Ceph, GlusterFS, NFS, or other cloud storage solutions
· CI/CD & Automation: GitHub Actions, Jenkins, Ansible, ArgoCD
· Monitoring & Logging: Prometheus, Grafana, ELK stack, Splunk, Datadog
· 2+ years of experience in SRE, DevOps, or Software Engineering roles.
· An understanding of Cloud infrastructure/operations is a must
· Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals
· Experience in Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
· Experience with containers, such as with Docker, Kubernetes and Open Shift
· Familiarity with Linux systems administration, networking, and distributed systems.
· Experience with troubleshooting production incidents and implementing permanent fixes.
· Ability to write clean, maintainable, and efficient automation code.
· Familiarity with Ansible, Bash, core Python development, and deployments in production environment
· Familiarity with one of C, C++, golang, python, or Java
· PHP and perl development experience
· Experience in monitoring applications such as Grafana, ELK stack, Prometheus, Nagios, and Sysdig
· Familiarity with cloud deployment tooling