Your Role and Responsibilities- Monitoring the health of the IKS control plane and ensuring reliable operations
- Responding promptly to production issues and alerts
- Executing changes in the production environment through advanced automation
- Partnering with other SRE teams and program managers to deliver mission-critical services
- Supporting the development and enhancement of Platform-as-a-Service services
- Implementing and automating solutions that support IBM Cloud products
- Ensuring compliance and security integrity of the environment
- Collaborating with Engineering to troubleshoot and resolve production issues
- Providing technical escalation support for other Infrastructure Operations teams
- Monitoring the health of the IKS control plane and ensuring reliable operations
- Responding promptly to production issues and alerts
- Executing changes in the production environment through advanced automation
- Partnering with other SRE teams and program managers to deliver mission-critical services
- Supporting the development and enhancement of Platform-as-a-Service services
- Implementing and automating solutions that support IBM Cloud products
- Ensuring compliance and security integrity of the environment
- Collaborating with Engineering to troubleshoot and resolve production issues
- Providing technical escalation support for other Infrastructure Operations teams
Required Technical and Professional Expertise
- Expertise in Kubernetes architecture, including the latest features and security aspects
- Strong debugging skills in Kubernetes environments.
- Strong experience in programming with Python or Go, with demonstrated ability to develop and maintain complex codebases.
- Proficiency in network configuration and advanced monitoring solutions such as Prometheus, SysDIG, and Grafana
- Experience in hands-on administration of cloud infrastructure, particularly Kubernetes-based platforms.
- Skills in performance tuning and optimization of Kubernetes clusters, including resource quota management, scaling, and efficient use of underlying infrastructure.
- Understanding of network protocols (TCP/IP, HTTP, etc.) and network configuration tools (e.g., CNI) specific to Kubernetes environments.
- Deep understanding of Kubernetes security practices, including network policies, security contexts, role-based access control (RBAC), and the secure handling of secrets.
- Knowledge of automation and configuration management tools: Ansible, Salt, Chef, Terraform
- Strong Linux skills for managing services across a microservices platform
- Ability to implement robust incident management strategies and frameworks
- Experience in performance optimization of Kubernetes clusters
- Understanding of disaster recovery planning and high availability setups in Kubernetes environments
- Excellent written and verbal communication skills, with a willingness to take on call-out responsibilities
- Experience establishing and improving procedures within a mission-critical environment
Preferred Technical and Professional Expertise
- Hands-on experience with any one of cloud infrastructures (IKS, AWS, Azure, GCP) and integrating cloud services for storage, security, and databases
- Knowledge of Slack bot automations for infra/cloud maintenance and SRE-based automations
- Active participation in Kubernetes communities and forums
- Vendor management skills to ensure optimal service levels and cost control
- Ability to mentor and train teams on Kubernetes best practices and operational strategies