Consistently models and champions site reliability culture and practices and exerts technical influence throughout your team
Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels
Drives collaboration with your team to identify comprehensive service level indicators and the stakeholder partners to establish reasonable service level objectives and error budgets with your customers
Offers a high level of technical expertise within one or more technical domains and proactively identifies and solves for technology-related bottlenecks in your areas of expertise
Serves as the main point of contact during major incidents for your application and have the skills to identify and solve the issue quickly to avoid financial loss to the business
Documents and shares knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
Extensive experience with cloud platform (AWS) in setting up infrastructure using Terraform.
Fluent in at least one programming language such as: Python, Java/Spring Boot, .Net
Advanced knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
Experience with troubleshooting common networking technologies and issues
Experience identifying and solving complex data structures and algorithms-related problems
Actively self-educates, evaluates new technology, and recommends suitable ones
Possess 7+ years of experience, ideally working with Data/Python applications in Production environment.
Experience with automation tool/solution such as Ansible, Autosys, Control-M etc.