What you'll be doing:
Design, deploy, and manage IaaS platforms with a focus on high availability and performance.
Automate infrastructure operations using tools like Terraform, Ansible, and Python.
Focus on efficiency by automating repetitive workflows.
Develop monitoring and observability tooling to detect and prevent outages using Prometheus, Grafana, ELK, etc.
Deploy and troubleshoot non-disruptive cloud operations with an emphasis on secure production infrastructure.
Manage deployment/upgrades for Operating Systems, Kubernetes (k8s) clusters, and other orchestration tools.
Provide day-to-day support for engineering activities with CI/CD tools like Git and Jenkins.
Implement and enforce best practices around infrastructure security, access control, and operational efficiency.
What we need to see:
BS degree in Computer Science, Software Engineering, or a related field (or equivalent experience).
3–5+ years of experience in a Site Reliability, DevOps, or Systems Engineering role.
Strong automation and scripting skills in Ansible, Python, and Shell Scripting.
Experience in IaaS environments, including deploying, configuring, and administering Linux-based bare metal servers.
Deep experience in infrastructure engineering, focused on managing and monitoring a highly available production infrastructure.
Skilled in observability practices, using Prometheus, Grafana, ELK/EFK, and integrated alerting systems.
Solid grasp of Linux internals and core networking concepts including NAT, DNS, DHCP, routing, and firewall configuration with iptables or nftables.
Experience with modern deployment architecture for non-disruptive cloud operations, including blue-green and canary rollouts.
Proficiency in Kubernetes, Docker, QEMU, and Libvirt.
Ways to stand out from the crowd:
Hands-on expertise with AWS, including deploying complex, load-balanced, and highly available workloads.
Proficiency in debugging network issues in both infrastructure and SDN.
Experience with performance tuning and benchmarking across storage, compute, or networking.
Implemented robust metrics collection and alerting infrastructure.
Familiar with compliance standards such as FedRAMP, HIPAA, and SOC 2.
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך