Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Distributed Systems Engineer AI Infrastructure 
China, Shanghai 
518306141

01.12.2024

NVIDIA is hiring a senior data and distributed systems engineer to architect, lead and develop our exa-scale AI infrastructure and deep learning platform for Autonomous Vehicles. You will need to have strong programming skills, a deep understanding of cloud technologies, distributed storage & compute systems, and distributed systems architecture. You will need to have excellent communication and planning skills. You ideally have experience in securing distributed systems or willingness to learn it. Finally, you will need engineering technical leadership skills. Together, we will build the exa-scale software 2.0 cloud platform for one of the most ambitious problems of our time: autonomous vehicles. Then we will apply it to other applications such as medical imaging, data science, genomics and more.

What you'll be doing:

  • Architect and build scalable and distributed services that will help power the AI infrastructure for deep learning platforms.

  • Design and build infrastructure and microservices that help index, mine, transform, and compose PB sized deep learning datasets.

  • Design the next generation of dataset management services for real and synthetic / simulated datasets.

  • You will enable smart data selection - one of the key ingredients for successful machine learning!

  • Collaborate with multiple AI teams to understand their requirements and build a future-proof platform that improves their productivity.

  • Be a technical leader on various projects across the platform, and be a major contributor of the entire platform’s architecture.

  • Support users of the platform.

What we need to see:

  • BS, MS, or PhD in Computer Architecture, Computer Science, Electrical Engineering or related field or equivalent experience.

  • 5+ years of Work or Research Experience in distributed systems development and design.

  • Strong programming background that incorporates methodologies like data structures, design patterns, OOP, and test driven development.

  • Proven technical foundation in distributed computing and storage, including significant experience with most of the following: server systems, storage, I/O, networking, and systems software.

  • Hands-on experience in or willingness to learn about authentication and authorization as well as the related technologies such as OIDC, TLS, AWS IAM, role-based access control, attribute-based access control, Open Policy Agent.

  • Advanced programming skills to build distributed storage and compute systems, backend services, microservices, and web technologies.

  • A specialist programmer in Go, Java or C/C++.

  • Ability to switch effectively between long-term strategic and near-term tactical topics.

  • Highly motivated with strong interpersonal skills, you have the ability to work successfully with multi-functional teams, principles and architects and coordinate optimally across interpersonal boundaries and geographies.

  • A track record of successful technical leadership and large-scale architecture that impacted critical projects.

Ways to stand out from the crowd:

  • Experience building MLOps or AI/ML solutions on-premise or in the cloud.

  • Hands-on experience in or willingness to learn about security topics such as secure design, secure coding, data protection, zero trust networks, and incident response management.

  • Sophisticated programming expertise in Scala, or Python.

  • Experience with Kubernetes and Docker as well as open source contributions.

  • A proactive demeanor to investigate and understand technical requirements.