Share
We are seeking a highly motivated and experienced Senior Platform Reliability Engineer (PRE) to join our growing team. In this critical role, you will be responsible for ensuring the reliability, scalability, and performance of our core platform and services. You will apply Site Reliability Engineering (SRE) principles to automate operations, improve system resilience, and drive a culture of continuous improvement across our engineering organization.
Reliability & Performance: Design, implement, and maintain systems and processes to ensure the high availability, performance, and scalability of our production platform.
Automation: Develop and implement automation for infrastructure provisioning, deployment, monitoring, and incident response, reducing manual toil and improving operational efficiency.
Observability: Implement and enhance comprehensive monitoring, logging, and alerting solutions to provide deep insights into system health and performance.
Incident Management: Lead incident response efforts, conduct root cause analyses, and implement preventative measures to minimize future occurrences.
Capacity Planning: Collaborate with development teams to forecast resource needs and ensure the platform can handle anticipated growth and traffic spikes.
System Design & Architecture: Provide input on system architecture and design, advocating for reliability, scalability, and operational best practices from the outset.
Tooling & Infrastructure: Evaluate, select, and implement new tools and technologies to improve our platform's reliability, security, and operational capabilities.
Collaboration & Mentorship: Work closely with development, QA, and security teams to embed reliability practices throughout the software development lifecycle. Mentor junior engineers on SRE principles and best practices.
Documentation:
Experience: 5+ years of experience in a DevOps, SRE, or similar role focused on platform reliability and operations.
Cloud Platforms: Strong hands-on experience with at least one major cloud provider (e.g., AWS, Azure, GCP).
Containerization & Orchestration: Expertise with Docker and Kubernetes for deploying and managing microservices.
Infrastructure as Code: Proficiency with IaC tools such as Terraform, CloudFormation, or Ansible.
Scripting & Programming: Strong scripting skills (e.g., Python, Bash) and experience with at least one compiled language (e.g., Go, Java, Node.js) for automation and tool development.
Monitoring & Alerting: Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic) and logging systems (e.g., ELK Stack, Splunk).
CI/CD: Solid understanding and experience with CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
AI Code Generation: Familiarity with foundational AI concepts and practical experience applying AI-powered coding generation (e.g., OpenAI Codex, GitHub Copilot, Anthropic Claude, Cursor, Windsurf or understanding of transformer-based code generation) will be a significant asset.
Networking: Fundamental understanding of networking concepts (TCP/IP, DNS, Load Balancing, Firewalls).
Databases: Familiarity with database operations, performance tuning, and backup/recovery strategies (SQL and NoSQL).
Problem-Solving: Exceptional analytical and troubleshooting skills, with a methodical approach to identifying and resolving complex system issues.
Communication: Excellent verbal and written communication skills, capable of effectively communicating technical concepts to diverse audiences.
Education: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
These jobs might be a good fit