- Drive large-scale training initiatives to support our most complex models.- Operationalize large-scale ML workloads on Kubernetes.- Enhance distributed cloud training techniques for foundation models.- Design and integrate end-to-end lifecycles for distributed ML systems- Develop tools and services to optimize ML systems beyond model selection.- Architect a robust MLOps platform to support seamless ML operations.- Collaborate with cross-functional engineers to solve large-scale ML training challenges.- Research and implement new patterns and technologies to improve system performance, maintainability, and design.- Lead complex technical projects, defining requirements and tracking progress with team members.- Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing.- Cultivate a team centered on collaboration, technical excellence, and innovation.