Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

IBM Site Reliability Engineer 
India, Karnataka, Bengaluru 
917243935

24.06.2024

Your Role and Responsibilities
  • Automation: Develop and maintain automation tools and scripts to streamline deployment, monitoring, and management of the infrastructure and applications.
  • Monitoring and Alerting: Set up and maintain monitoring and alerting systems to proactively identify and resolve issues before they impact customers or services. Including participation in on-call rotations to respond promptly to high priority incidents.
  • Performance Optimization: Identify opportunities for performance optimization and work with development teams to implement improvements.
  • Documentation: Maintain up-to-date documentation for the infrastructure, processes, and procedures.
  • Collaboration: Work closely with development teams, product managers, and other stakeholders to understand requirements and ensure the reliability of the platform.
  • Continuous Improvement: Participate in post-incident reviews, retrospectives, and other forums to identify areas for improvement and drive continuous improvement initiatives.


Required Technical and Professional Expertise

  • Strong Linux systems engineering background with CentOS/RHEL or Debian including experience building, maintaining and troubleshooting these systems.
  • Automation and Scripting: Strong scripting skills (e.g., Bash, Python) and experience with configuration management tools (e.g., Ansible, Chef, Puppet) to automate deployment and management tasks.
  • Excellent Git skills (merges, branching, forking)
  • Experience with Cloud Platforms: Strong experience with cloud platforms such as IBM, AWS, Azure, or Google Cloud Platform, including expertise in:
    • Deploying and managing services in these environments.
    • Managing, and troubleshooting containerized applications.
  • Troubleshooting and Problem Solving: Strong troubleshooting skills and the ability to quickly identify and resolve complex issues in a production environment, including experience with incident response and post-incident analysis.


Preferred Technical and Professional Expertise

  • DevOps Culture: Experience working in a DevOps culture and mindset, including a strong understanding of the collaboration between development and operations teams to achieve business goals.
  • Container Orchestration: Proficiency in container orchestration tools such as Nomad or Kubernetes, including experience with Hashicorp Consul/Vault or equivalents.
  • Monitoring and Logging: Experience with monitoring and logging tools (e.g., ELK stack, Grafana, Prometheus) to monitor the health and performance of infrastructure and applications. Including experience building and maintaining these tools.
  • Security: Knowledge of implementing security best practices and maintaining compliance standards (Center for Internet Security (CIS) Benchmarks, FedRAMP).
  • Security: Ability to patch software or adjust configurations to mitigate Common Vulnerabilities and Exposures (CVE) in a timely fashion.
  • Experience with clustered time series database technologies such as InfluxDB as well as experience with distributed event streaming platforms using Kafka and Telegraf.
  • CI/CD: Experience with application deployment using CI/CD tools such as Jenkins and Tekton.
  • Working knowledge with GitHub, JIRA, Confluence, and ServiceNow.