Finding the best job has never been easier
Share
What you’ll be doing:
Build internal profiling and analysis tools for AI workloads at large scale
Build debugging tools for common encountered problems like memory or networking
Create benchmarking and simulation technologies for AI system or GPU cluster
Partner with HW architects to propose new features or improve existing features with real world use cases
What we need to see:
BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
Strong software skills in design, coding (C++ and Python), analytical, and debugging
Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
Experience with NVIDIA GPUs, CUDA Programming and NCCL
Motivated self-starter with strong problem-solving skills and customer-facing communication skills
Passion for continuous learning. Ability to work concurrently with multiple global groups
Ways to stand out from the crowd:
Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
Solid experience in large AI job performance analysis for training/inference workload
Knowledge of Linux device drivers and/or compiler implementation
Knowledge of GPU and/or CPU architecture and general computer architecture principles
You will also be eligible for equity and .
These jobs might be a good fit