Resilience & Reliability are fundamental to ensure modern architectures are available, performant and fault aware.
A Resilience & Reliability Consultant will help in designing the roadmap to achieve Resilience for Enterprise IT
An Reliability Consultant will be a technical advisor to strategize the transformation roadmap to modernize IT delivery with SRE principles, frameworks and levers – in a nutshell, setting up SRE into Enterprise IT
They will also implement the Reliability / SRE Roadmap and govern SRE solutions across the enterprise / line-of-businesses
They will be able to assess Resilience & Reliability Maturity of an IT Organization and provide strategy and roadmap to achieve higher maturity levels
Responsibilities
Defining SLA/SLO/SLI for a product / service
Engineering in resilient design and implementation practices into solutions as they go through the product life cycle
Designing & implementing Observability Solutions to track, report, and measure SLA adherence
Engineering out manual effort (Toil) through the development of automated processes and services (e.g., Automated Management of Systems, CI/CD improvements)
Optimize Cost of IT Infra & Operations - FinOps
·Review, Analysis and Improvement of deployed products with respect to product architecture and inter-service dependencies - Simplification
Typical Skills and Background
15+ years of experience in software product engineering principles, processes and systems
Hands-on experience in Java / J2EE, one of web server (Apache Tomcat or IBM HTTP Server), one of theapplication servers (Tomcat/WebSphere), and any major RDBMS like Oracle
Hands-on experience in at least one CI-CD (Azure DevOps, GitLab CI/CD, Jenkins) and IaC tools (Terraform, AWS CloudFormation, Ansible etc.)
Experience in at least one cloud technology (AWS/Azure/GCP etc. and Docker, Pivotal, Kubernetes, OpenShift etc.) and its reliability tools (Azure AppInsight, CloudWatch, Azure Monitor etc.)
Experience in Observability - APM tools (Dynatrace, AppDynamics etc.), metrics / log consolidation (Splunk) and ELK Stack
Defining NFRs and SLA/SLO/SLI agreement for a product / platform / services
Knowledge on queuing models used, thread pools, request servicing processes etc.
Experience in Linux (RHEL) operating system performance monitoring parameters and their interpretation, commands used for monitoring
Experience in Web Services, SOA, ESB (DataPower), RESTFul
Knowledge of application design patterns, J2EE application architectures, Microservices, Spring boot & Cloud native architectures