Reliability & Performance: Design, implement, and maintain systems and processes to ensure the high availability, performance, and scalability of our production platform.
Automation: Develop and implement automation for infrastructure provisioning, deployment, monitoring, and incident response, reducing manual toil and improving operational efficiency.
Observability: Implement and enhance comprehensive monitoring, logging, and alerting solutions to provide deep insights into system health and performance.
Incident Management: Lead incident response efforts, conduct root cause analyses, and implement preventative measures to minimize future occurrences.
Capacity Planning: Collaborate with development teams to forecast resource needs and ensure the platform can handle anticipated growth and traffic spikes.
System Design & Architecture: Provide input on system architecture and design, advocating for reliability, scalability, and operational best practices from the outset.
Tooling & Infrastructure: Evaluate, select, and implement new tools and technologies to improve our platform's reliability, security, and operational capabilities.
Collaboration & Mentorship: Work closely with development, QA, and security teams to embed reliability practices throughout the software development lifecycle. Mentor junior engineers on SRE principles and best practices.
Documentation:
Experience: 5+ years of experience in a DevOps, SRE, or similar role focused on platform reliability and operations.
Cloud Platforms: Strong hands-on experience with at least one major cloud provider (e.g., AWS, Azure, GCP).
Containerization & Orchestration: Expertise with Docker and Kubernetes for deploying and managing microservices.
Infrastructure as Code: Proficiency with IaC tools such as Terraform, CloudFormation, or Ansible.
Scripting & Programming: Strong scripting skills (e.g., Python, Bash) and experience with at least one compiled language (e.g., Go, Java, Node.js) for automation and tool development.
Monitoring & Alerting: Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic) and logging systems (e.g., ELK Stack, Splunk).
CI/CD: Solid understanding and experience with CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
AI Code Generation: Familiarity with foundational AI concepts and practical experience applying AI-powered coding generation (e.g., OpenAI Codex, GitHub Copilot, Anthropic Claude, Cursor, Windsurf or understanding of transformer-based code generation) will be a significant asset.
Networking: Fundamental understanding of networking concepts (TCP/IP, DNS, Load Balancing, Firewalls).
Databases: Familiarity with database operations, performance tuning, and backup/recovery strategies (SQL and NoSQL).
Problem-Solving: Exceptional analytical and troubleshooting skills, with a methodical approach to identifying and resolving complex system issues.
Communication: Excellent verbal and written communication skills, capable of effectively communicating technical concepts to diverse audiences.
משרות נוספות שיכולות לעניין אותך