המקום בו המומחים והחברות הטובות ביותר נפגשים
What you will do
Manage, deploy, and operate cloud solutions at scale using the principles of Site Reliability Engineering
Participate in the design, implementation and reliability of ML Pipelines.
Participate in the design and development of new features to enable Data 'as-a-service'
Design and write automation software to provision, upgrade, monitor, and heal Data 'as-a-service'
Identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions
Define Service level Objectives, implement them along with runbooks
Participate in product release cycles, deploying code to integration, staging and production environments, integrating with CI/CD tooling, monitoring and change management
Interact with automated monitoring and healing infrastructure to ensure healthy environments
Help and develop peers through knowledge sharing, mentoring and collaboration
Create and maintain standard operating procedures (SOPs) for performing maintenance tasks, applying configuration changes and remediating problems in our environment
Participate in a follow-the-sun on-call rotation
Contribute software tests and participate in peer review to increase the quality of our codebase
What you will bring
Mandatory 4+ years of software engineering experience using one or more programming languages such as Golang,Python, Java, Ruby. You should be able to hands-on code on a daily basis in any of the mentioned languages or equivalent.
Experience in developing, manage Infrastructure as code automation platforms - Terraform or equivalent
Experience in troubleshooting as-a-service offerings (SaaS, PaaS, etc.)
Experience with any of the public cloud services.
Mandatory hands-on experience using Kubernetes/OpenShift
Writing/maintaining Kubernetes Operator is good to have.
Prior experience in building ML Pipelines is a huge plus.
Knowledge or prior experience of Gitlab Pipelines, TektonCD/ArgoCD, Kubeflow is huge plus.
Experience with developing and using monitoring and observability tools/stack.
Maintaining SLOs of the responsible services.
Being customer (internal or external) focused is a must.
Good communications skills and experience working within a team and collaborating with other teams.
Ability to quickly learn new technologies.
משרות נוספות שיכולות לעניין אותך