Job responsibilities
- Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
- Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels
- Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
- Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise - CI/CD management and automation to achieve one-touch deployment across all application tiers
- Writing specifications and documentation for application release management
- Define application performance KPIs and create/manage the capacity framework
- Infrastructure build, management, integration with core services and hygiene
- Application setup, migration and maintenance in private/public cloud [AWS, Azure]
- Docker Containers, automating container image creation process, build and deployment in container environment
- Define application availability KPIs, setup monitoring frameworks and publish the uptime & SLAs
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 10+ years applied experience.
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Strong Linux/Unix fundamentals, good understanding of subsystems such as memory, storage, network.
- Experience of Continuous Integration technologies, such as: Jules, Maven, Ant, Selenium, Cucumber, Mocks, JMeter, JUnit, etc. is expected.
- Ability to understand the business services and map it to the reliability engineering design and review.
- Support the technology and business services of the entire technology platforms from the scaling and performance perspective.
- Manage the uptime of each of the micro services by building and implementation of the right monitoring and alerts.
- Good understanding of object oriented programming, relational databases, NOSQL, caching systems, etc.
- Strong problem management abilities by automating any repeatable jobs and working with the stakeholders to ensure the incidents do not repeat again.
Preferred qualifications, capabilities, and skills
- Self-starter and a Team player able to work effectively among and across Tech, Business, and Ops teams.
- Excellent verbal and written communication skills. Deep understanding of architectural concepts, issues and trends.
- Ability to work independently and in a team & Proficient at researching innovative solutions for challenging technical problems.
- Willingness to pick up and learn new technologies, frameworks and tools as directed.
- Looking for someone who brings a lot of positive energy.