Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

IBM Site Reliability Engineer 
India, Telangana, Hyderabad 
72253929

Today

Your Role and Responsibilities
Looking for 3+ years of experience candidate with the following experience

Your Role and Responsibilities

  • Monitoring the health of the IKS control plane and ensuring reliable operations
  • Responding promptly to production issues and alerts
  • Executing changes in the production environment through advanced automation
  • Partnering with other SRE teams and program managers to deliver mission-critical services
  • Supporting the development and enhancement of Platform-as-a-Service services
  • Implementing and automating solutions that support IBM Cloud products
  • Ensuring compliance and security integrity of the environment
  • Collaborating with Engineering to troubleshoot and resolve production issues
  • Providing technical escalation support for other Infrastructure Operations teams


Required Technical and Professional Expertise

  • Expertise in Kubernetes architecture, including the latest features and security aspects
  • Strong debugging skills in Kubernetes environments.
  • Strong experience in programming with Python or Go, with demonstrated ability to develop and maintain complex codebases.
  • Proficiency in network configuration and advanced monitoring solutions such as Prometheus, SysDIG, and Grafana
  • Experience in hands-on administration of cloud infrastructure, particularly Kubernetes-based platforms.
  • Skills in performance tuning and optimization of Kubernetes clusters, including resource quota management, scaling, and efficient use of underlying infrastructure.
  • Understanding of network protocols (TCP/IP, HTTP, etc.) and network configuration tools (e.g., CNI) specific to Kubernetes environments.
  • Deep understanding of Kubernetes security practices, including network policies, security contexts, role-based access control (RBAC), and the secure handling of secrets.
  • Knowledge of automation and configuration management tools: Ansible, Salt, Chef,Terraform
  • Strong Linux skills for managing services across a microservices platform
  • Ability to implement robust incident management strategies and frameworks
  • Experience in performance optimization of Kubernetes clusters
  • Understanding of disaster recovery planning and high availability setups in Kubernetes environments
  • Excellent written and verbal communication skills, with a willingness to take on call-out responsibilities
  • Experience establishing and improving procedures within a mission-critical environment


Preferred Technical and Professional Expertise

  • Hands-on experience with any one of cloud infrastructures (IKS, AWS, Azure, GCP) and integrating cloud services for storage, security, and databases
  • Knowledge of Slack bot automations for infra/cloud maintenance and SRE-based automations
  • Active participation in Kubernetes communities and forums
  • Vendor management skills to ensure optimal service levels and cost control
  • Ability to mentor and train teams on Kubernetes best practices and operational strategies