Required qualifications, capabilities, and skills
- Formal training, or certification on software engineering concepts and 5+ years applied experience
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Fluency in at least one programming language such as (e.g., Python, Java Spring Boot, .Net, etc.)
- Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common networking technologies and issues
- Ability to identify and solve problems related to complex data structures and algorithms
- Drive to self-educate and evaluate new technology
- Ability to teach new programming languages to team members
Preferred qualifications, capabilities, and skills
- Expertise on Observability tooling, including Dynatrace, Splunk, Grafana, Datadog, Cloudwatch
- Exposure to Chaos Testing/Experiments using Gremlin
- Expertise on AWS services to be able to design for reliability, and to identify any improvement opportunities in existing architectures
- Working experience on application development using Java / Python or any other coding languages
- Certifications in AWS, Splunk, Dynatrace, Terraform, Python etc. would be preferred
- Basic to intermediate level of understanding on Mainframe applications (Cobol, JCL, DB2, IMS, CICS, VSAM) is preferred