Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

IBM SRE Architect 
India, Karnataka, Bengaluru 
833389198

Yesterday

Your Role and Responsibilities
As an Architect for Site Reliability Engineering , the focus is to ensure that the designed solution responds to non-functional requirements such as reliability, availability, performance, security, and maintainability. You will closely work with the development and other related Release and extended support teams.
  • You will bring a strong engineering focus to operations, putting your leadership to identify methods for preventing incidents, increasing observability, automation frameworks, self-service infrastructure, logging and metrics, and operational reports.
  • You will be expected to use tools include logging, monitoring, event management, notification, Runbook Automation, ChatOps, Root Cause Analysis.
  • You will work with Automation Engineers and QA Engineers, development team to ensure seamless delivery of our service offerings.
  • Build sufficient expertise in the IBM Cloud control plane to create automated monitoring processes

In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.

Your primary responsibilities include:

  • 24×7 Observability
  • Cross-Functional Troubleshooting : Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
  • Deployment and Configuration : Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale.
  • Security and Compliance Implementation : Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA.
  • Maintenance and Support
  • Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs
  • Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
  • Writing, updating, and using documentation, including runbooks/playbooks
  • Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
  • Debugging complex problems across an entire stack and creating solid solutions
  • Developing CI/CD processes to improve cadence
  • Persistent testing of application and infrastructure resiliency over a variety of error conditions.
  • Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
  • Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks.
  • Standup and maintain pre-production and developer environments to support the entire development organization and improve overall team velocity
  • Use metrics and analytics to determine reliability issues and remove them through automation and tooling
  • Be an advocate for our customers, providing them self-diagnosing tools to resolve common issues that arise in the field


Required Technical and Professional Expertise

  • 10+ yrs of SRE/Level 3 support experience
  • A solid understanding of Cloud infrastructure/operations
  • Expertise on Linux internals
  • Experience debugging complex problems
  • Experience designing, building, and operating large-scale production systems
  • Expertise in Ansible, Bash, core Python development
  • Strong familiarity with one of C, C++, golang, Python, or Java
  • Experience with containers, such as with Docker, Kubernetes
  • Experience with standard industry tools for monitoring and observability
  • Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
  • A strong understanding of diverse infrastructure platforms and infrastructure concepts required.
  • Has hands-on experience using source control and feature branching strategies
  • Understands networking and messaging, especially between services
  • Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration, Incident management and support
  • Strong communication skills


Preferred Technical and Professional Expertise

  • IBM Cloud API knowledge
  • Behavior Driven Development
  • Experience in Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
  • Familiarity with cloud deployment tooling such as razee and launch darkly