Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Nvidia Senior AI ML Infra Engineer Research Clusters 
United States, Texas 
601344643

24.06.2024

In this role, you will have the chance to

  • Contribute to advanced AI/ML infrastructure solutions that have a direct impact on the efficiency of our highly skilled research teams.

  • A dynamic and collaborative environment that values innovation, creativity, and continuous improvement.

  • Competitive compensation and comprehensive benefits package.

  • Opportunities for professional growth and career advancement within the AI/ML infrastructure domain.

What you will be doing:

  • Work closely with our research teams to comprehend their infrastructure requirements and challenges, translating those observations into actionable enhancements.

  • Design and implement solutions for critical areas such as storage management for datasets and logs, error attribution, and core reliability issues within our large scale GPU clusters.

  • Continuously monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization.

  • Create and deploy automation tools, monitoring solutions, and effective operational strategies to simplify infrastructure management and minimize manual tasks.

  • Help define and enhance important measures of AI researcher productivity, ensuring that our actions are in line with measurable results.

  • Collaborate with diverse teams, including researchers, data engineers, and DevOps professionals, to create a seamless and integrated AI/ML infrastructure ecosystem.

  • Keep abreast of the latest advancements in AI/ML infrastructure technologies, frameworks, and effective strategies, and promote their implementation within the company.

What we need to see:

  • BS or equivalent experience (MS preferred) in Computer Science or related with 12+yrs of relevant experience

  • Strong background in software engineering, with experience in building and maintaining large-scale distributed systems, preferably in the context of AI/ML infrastructure.

  • Proficiency in programming languages such as Python, Go, or C++, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure).

  • Hands-on experience with containerization technologies (e.g., Docker, Kubernetes), automation tools (e.g., Ansible, Terraform), and monitoring solutions (e.g., Prometheus, Grafana).

  • Understanding of AI/ML workflows, including data processing, model training, and inference pipelines.

  • Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.

  • Excellent communication and collaboration skills, with the ability to work effectively with diverse teams and individuals.

  • Enthusiasm for continual learning and keeping abreast of emerging technologies and effective approaches in the AI/ML infrastructure field.

You will also be eligible for equity and .