Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

JFrog Site Reliability Engineering Manager 
Dominican Republic, Santo Domingo 
77985927

27.03.2025
As a Site Reliability Engineering Manager at JFrog you will…
  • Lead, mentor, and develop a high-performing SRE Israel team, fostering collaboration, innovation, and accountability
  • Ensure SaaS reliability, performance, and availability, meeting or exceeding service-level objectives
  • Drive SRE best practices, including capacity planning, incident management, chaos engineering, and disaster recovery
  • Implement proactive monitoring, alerting, and anomaly detection aligned with SaaS standards
  • Collaborate with P&E and Cloud engineering teams to embed reliability into the SDLC
  • Oversee incident management, ensuring swift identification, escalation, and resolution
  • Maintain comprehensive SRE documentation, including processes, incident reports, and system architecture
  • Evaluate and adopt tools, technologies, and methodologies to enhance uptime and reliability
  • 3+ years of management experience leading a team of SRE, DevOps, or a similar SaaS role
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
  • Strong expertise in cloud platforms (AWS, GCP, or Azure), containers (Kubernetes, Docker), and configuration management (Terraform, Ansible)
  • Proficiency in Python or Go for automation and system optimization, as well as GitOps experience with SCM tools (e.g., Git, Bitbucket)
  • Strong leadership, communication, and collaboration skills, working across globally distributed teams
  • Familiarity with Agile methodologies, CI/CD pipelines, and orchestration tools (Jenkins, ArgoCD, StackStorm)
  • Familiarity with Chaos Engineering (e.g., Gremlin, Litmus, Chaos Toolkit)
  • Hands-on with alerting & observability tools (e.g., PagerDuty, OpsGenie, New Relic, Coralogix)
  • Strong understanding of scalability, high availability, and security best practices in cloud & Kubernetes environments