

You will be required to deeply understand technology landscapes, and evaluate the use of new technologies. You will be influential within your team and work with peers and senior leaders to define and revise the standards for operational excellence across systems. You will consistently tackle abstract issues that span multiple functional areas and drive your team to push for improvements that can scale across other teams, services, and platforms.
Key job responsibilities
Identify performance bottlenecks in compute infrastructure and propose solutions to address them.Provide support for cluster and node management, ensuring smooth operation of GenAI infrastructure.
Participate in design and code reviews and identify bottlenecks.
Troubleshoot and research root causes thoroughly and fix defects.
Continuously improve and automate our cluster/capacity/maintenance upgrades.Experienced in setting up and managing CI/CD pipelines using tools such as AWS CodePipeline, GitHub Actions, or similar platforms.
Familiarity with Infrastructure as Code (IaC) tools like AWS CloudFormation, Terraform, or the AWS CDK is a valuable asset. Furthermore, an understanding of networking concepts like VPC, subnets, and security groups, as well as configuring Load Balancers and Route 53, is desirable.
Should have hands-on experience in Kubernetes.
- 3+ years of administrative experience in networking, storage systems, operating systems and hands-on systems engineering experience
- Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust
- Experience with Linux/Unix
- Experience with CI/CD pipelines build processes
- Experience with distributed systems at scale
משרות נוספות שיכולות לעניין אותך