Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Principal Engineer Distributed Machine Learning 
United States, California 
335430263

01.09.2024

What you’ll be doing:

  • Design and develop new user-friendly APIs and libraries to optimally use existing DL/ML frameworks in GPU-enabled Spark clusters for distributed DL/ML training and inference at scale.

  • Design and develop GPU accelerated ML libraries for distributed training and inference on Spark clusters, e.g., to improve our existing open source library.

  • Demonstrate superior performance of developed solutions on industry standard benchmarks and datasets.

  • Make technical contributions to enhance capabilities of open source projects such as RAPIDS, XGBoost, , and Apache Spark.

  • Work with NVIDIA partners and customers on deploying distributed ML algorithms in cloud or on-premise.

  • Keep up with published advances in distributed ML systems and algorithms.

  • Provide technical mentorship to a team of engineers.

What we need to see:

  • BS, MS, or PhD in Computer Science, Computer Engineering, or closely related field (or equivalent experience).

  • 12+ years of work or research experience in software development.

  • 5+ experience as technical lead in distributed machine learning and/or deep learning.

  • 3+ years of open source development experience.

  • 3+ years of hands-on experience with Spark MLlib, XGBoost, and/or PyTorch.

  • Knowledge of internals of Apache Spark MLlib.

  • Experience with Kubernetes, YARN, Spark, and/or Ray for distributed ML orchestration.

  • Proven technical skills in designing, implementing and delivering high-quality distributed systems.

  • Excellent programming skills in C++, Scala, and Python.

  • Familiar with agile software development practice.

Ways to stand out from the crowd:

  • Familiarity with NVIDIA libraries ( , ) is a plus.

  • Familiarity with NVIDIA GPUs and CUDA is also a strong plus.

  • Familiarity with Horovod, Petastorm and other existing/past distributed learning libraries is desirable.

  • Experience working with multi-functional teams across organizational boundaries and geographies.

You will also be eligible for equity and .