Job responsibilities
- Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels
- Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
- Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
- Ensures the reliability, availability, and performance of PAM systems leveraging vendor products including CyberArk, BeyondTrust, Microsoft, and HashiCorp Vault
- Develops and maintain automation scripts and tools to streamline PAM operations and reduce manual intervention
- Responds to and resolve incidents related to PAM systems in a timely manner
- Conducts regular security assessments and audits of PAM systems
- Works closely with Technology Teams to ensure seamless integration and operation of PAM systems
- Provides weekend coverage on Saturday and Sunday from 8 AM to 5 PM SGP, when needed
- Participates in an on-call rotation for after-hours support as needed..
Required qualifications, capabilities, and skills
- Bachelor’s Degree in Computer Science, Engineering, Mathematics or other related disciplines
- 5+ years or equivalent expertise in site reliability engineering
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Fluency in at least one programming language such as (e.g., Python, Java Spring Boot, .Net, etc.)
- Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common networking technologies and issues
- Ability to identify and solve problems related to complex data structures and algorithms
- Ability to expand and collaborate across different levels and stakeholder groups