Job summary
Job responsibilities
- Provides end-to-end application or infrastructure service delivery to enable successful business operations of the firm
- Supports the day-to-day maintenance of the firm’s systems to ensure operational stability and availability
- Assist in the monitoring of production environments for anomalies and address issues utilizing standard observability tools
- Identify issues for escalation and communication, and provide solutions to the business and technology stakeholders
- Analyze complex situations and trends to anticipate and solve incident, problem, and change management in support of full stack technology systems, applications, or infrastructure
- Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers
- Supports the adoption of site reliability engineering best practices within your team
Required qualifications, capabilities, and skills
- Bachelor's Degree in Computer Science or equivalent
- Formal training or certification on Computer Science concepts and 2+ years applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Proficient in at least one programming language such as Python, Java/Spring Boot, and .Net
- Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.)
- Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Familiarity with troubleshooting common networking technologies and issues
- Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision
- Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation
- Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
- Ability to initiate and implement ideas to solve business problems
Preferred qualifications, capabilities, and skills
- Working understanding of public cloud, AWS Certification is a plus.
- Experience working in SRE model based teams and IT best practices - ITIL framework, Agile and SDLC.
- Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
- Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker
Note: This is a shift role, including regular weekend rotations, with comp off days provided.