Share
What you will do
Design and build observability and optimization tools for large-scale GenAI workloads running on Kubernetes
Develop systems to collect and analyze model performance metrics, logs, and resource usage in real-time
Innovate in the MLOps and AI observability domain by contributing to upstream communities
Collaborate with product, engineering, and research teams to improve model trust and performance
Write unit and integration tests and work with quality engineers to ensure product quality
Use CI/CD best practices to deliver solutions into RHOAI as part of our productization efforts
Contribute to a culture of continuous improvement by sharing technical knowledge and insights
Communicate effectively with stakeholders and team members to ensure visibility of ML performance
Represent RHOAI in external engagements including open source communities and customer meetings
Mentor and guide junior engineers and contribute to team growth
What you will bring
Experience in machine learning engineering, with a focus on production-grade systems
Proficiency in Python with a focus on AI/ML infrastructure or tooling
Experience working with Kubernetes, OpenShift, or other cloud-native platforms
Familiarity with ML observability tools (e.g. Prometheus, OpenTelemetry, and Grafana)
Hands-on experience with source control tools such as Git
Passion for open-source technology and collaborative development
Strong troubleshooting skills and system-level thinking
Ability to work autonomously and thrive in a fast-paced environment
Excellent written and verbal communication skills
The following will be considered a plus:
Master’s degree or higher in computer science, machine learning, or related discipline
Contributions to open-source projects, especially in the MLOps or ML observability domain
Experience with public cloud services (AWS, GCP, Azure)
Background in developing or deploying MLOps platforms or AI monitoring tools
These jobs might be a good fit