Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

KLA Sr Platform Engineer- GenAI 
United States, Michigan, Ann Arbor 
841074477

12.03.2025

Qualifications

  • Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable solutions
  • Develop advanced AI/ML infrastructure solutions that enhance the efficiency of our skilled ML teams
  • Design and implement solutions for critical areas, including distributed storage systems, scheduling systems, high availability capabilities, and core reliability issues within our large-scale GPU clusters
  • Monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization
  • Develop and deploy automation tools, monitoring solutions, and operational strategies to streamline infrastructure management and reduce manual tasks
  • Work with various teams, including ML developers, data engineers, and DevOps professionals, to create a cohesive and integrated AI/ML infrastructure ecosystem
  • Implement and manage GPU infrastructure within Kubernetes clusters to support high-performance computing and AI/ML tasks
  • Deploy and manage open-source GenAI components, such as vector databases and various AI/ML models, ensuring seamless integration and optimal performance
  • Evaluate and integrate new open-source GenAI tools and technologies to enhance the platform’s capabilities
  • Collaborate with the research and development teams to implement and optimize innovative AI/ML models and algorithms
  • Ensure the security and compliance of open-source GenAI components within the infrastructure
  • Leverage High-Performance Computing (HPC) experience to optimize and manage large-scale AI/ML workloads
  • Design, implement, and manage on-premises, cloud, and hybrid-based ML platforms to support diverse AI/ML workloads and ensure flexibility and scalability

Minimum Qualifications

  • Bachelor's Degree or equivalenttraining/certificationsin Computer Science or related IT field
  • Eight (8) years of implementing and maintaining AI/ML Infrastructure On-Prem environment
  • Strong experience with AI/ML infrastructure and tools, including GPU clusters and Kubernetes
  • Proficiency in deploying and managing open-source GenAI components and vector databases
  • Hands-on experience with high-performance computing (HPC) environments
  • Expertise in designing and managing on-premises, cloud, and hybrid-based ML platforms
  • Solid understanding of distributed storage systems, scheduling systems, and high availability capabilities