Expoint - all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Nvidia HPC AI Software Architect 
Switzerland, Vaud 
325198767

Today
Switzerland, Zurich
Poland, Remote
UK, Remote
Switzerland, Remote
Germany, Remote
time type
Full time
posted on
Posted 4 Days Ago
job requisition id

What you will be doing:

  • Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency.

  • Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.

  • Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.

  • Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.

  • Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.

What we need to see:

  • Ph.D. or equivalent industry experience in computer science, computer engineering, or a closely related field.

  • 2+ years of experience in systems programming, parallel or distributed computing, or high-performance data movement.

  • Strong programming background in C++, Python, and ideally CUDA or other GPU programming models.

  • Practical experience with AI frameworks (e.g., PyTorch, TensorFlow) and familiarity with how they use communication libraries under the hood.

  • Experience in designing or optimizing software for high-throughput, low-latency systems.

  • Strong collaboration skills in a multi-national, interdisciplinary environment.

Ways to stand out from the crowd:

  • Expertise with NCCL, Gloo, UCX, or similar libraries used in distributed AI workloads.

  • Background in networking and communication protocols, RDMA, collective communications, or accelerator-aware networking.

  • Deep understanding of large model training, inference serving at scale, and associated communication bottlenecks.

  • Knowledge of quantization, tensor/activation fusion, or memory optimization for inference.

  • Familiarity with infrastructure for deployment of LLMs or transformer-based models, including sharding, pipelining, or hybridparallelism.