Share
What you will be doing:
Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency.
Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.
Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.
Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.
Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.
What we need to see:
Ph.D. or equivalent industry experience in computer science, computer engineering, or a closely related field.
2+ years of experience in systems programming, parallel or distributed computing, or high-performance data movement.
Strong programming background in C++, Python, and ideally CUDA or other GPU programming models.
Practical experience with AI frameworks (e.g., PyTorch, TensorFlow) and familiarity with how they use communication libraries under the hood.
Experience in designing or optimizing software for high-throughput, low-latency systems.
Strong collaboration skills in a multi-national, interdisciplinary environment.
Ways to stand out from the crowd:
Expertise with NCCL, Gloo, UCX, or similar libraries used in distributed AI workloads.
Background in networking and communication protocols, RDMA, collective communications, or accelerator-aware networking.
Deep understanding of large model training, inference serving at scale, and associated communication bottlenecks.
Knowledge of quantization, tensor/activation fusion, or memory optimization for inference.
Familiarity with infrastructure for deployment of LLMs or transformer-based models, including sharding, pipelining, or hybridparallelism.
These jobs might be a good fit