This role requires not only excellent engineering skills but also strategic vision, thought leadership, and the ability to mentor and influence others across multiple teams.
What You Will Do:
Lead the quality strategy and implementation for Kubernetes-native components in Model Serving, including Custom Resources, Controllers, and Operators.
Own and evolve automated test architecture with a focus on PyTest, CI/CD, integration testing, and end-to-end testing in Kubernetes environments.
Partner with engineering, product, and community teams to define testability requirements, ensure early validation, and prevent regressions.
Design tests that validate system-level properties including scalability, autoscaling, observability, and reliability for AI workloads.
Participate and influence upstream communities (KServe, Kubeflow, ModelMesh, etc.), raising quality standards and sharing best practices.
Drive efforts to mock, simulate, and validate model serving use cases in hybrid cloud and on-prem environments.
Serve as a technical mentor and go-to expert for Python-based testing frameworks and Kubernetes-native validation strategies.
Take a lead role in debugging complex system-level issues, especially in multi-tenant, distributed AI systems.
Champion Shift-left testing and early validation practices across the RHOAI stack.
What You Will Bring:
Proven expertise with Kubernetes API development and testing (CRs, Operators, Controllers). Experience working directly with Custom Resources and reconciliation logic is essential.
Strong programming and testing experience in Python, especially with PyTest in large, scalable codebases. Golang knowledge is a plus.
Deep understanding of Kubernetes internals, networking, and lifecycle hooks. Experience with OpenShift is a plus.
Extensive knowledge of CI/CD pipelines, especially in containerized or cloud-native ecosystems (e.g., GitHub Actions, Tekton, Jenkins).
Strong knowledge of test strategy for ML model serving systems, including considerations for runtime performance, isolation, and failure recovery.
Experience with troubleshooting distributed systems and validating observability via Prometheus, Grafana, OpenTelemetry, etc.
A proven ability to lead technical projects and mentor others across teams and time zones.
Excellent communication skills and comfort presenting to engineers, managers, and external stakeholders.
Preferred (Nice-to-Have):
Hands-on experience with KServe, ModelMesh, Ray, vLLM, or other model serving frameworks.
Familiarity with Red Hat Service Mesh, Istio, Knative, or similar serverless/K8s-native middleware stacks.
Experience with performance/load testing frameworks and chaos testing in Kubernetes.
Contribution history in open-source projects or technical leadership in community forums.
The salary range for this position is $116,270.00 - $191,840.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
משרות נוספות שיכולות לעניין אותך