Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior DevOps Engineer - AI Infrastructure 
China, Shanghai 
986746840

12.08.2024
What you’ll be doing:
  • Collaborate with multiple AI product teams to understand their data and compute requirements (focusing on Autonomous Vehicle at this moment)

  • Build infrastructure and tools that will increase the productivity of teams developing AI-based systems (data close loop, labeling/training of deep learning, debugging/replay of Autonomous Vehicle issues, etc.)

  • Enable development team by providing automated build and test solutions in simulation environments using cloud computing, Kubernetes, Docker, and physical deep learning machines

  • Maintain version control schemas to track development, staging, and production code using git

  • Orchestratecreate/delete/upgradeof live systems using maintenance windows, HA failover, and immutable infrastructure patterns

  • Work with multiple teams and domain experts to integrate multiple NVIDIA products into the CI workflow

  • Automate sophisticated tasks and improve the efficiency of functional automated tests

  • Be part of an on-call rotation to support production systems, respond to incidents promptly, conduct root cause analysis of outages and implement preventive measures.

What we need to see:
  • BS/MS with 4+ years of experience

  • Solid technical foundation in automation, cloud infrastructure and orchestration, including experience with at least one orchestration system (Kubernetes, Swarm, Mesos, Marathon, Aurora, etc)

  • Experienced with microservices and ETL jobs

  • You have experience with cloud automation tools (Ansible, Terraform, etc)

  • Excellent understanding of AWS: EC2, S3, RDS, ECS, CloudFront, VPC, or equivalents in Aliyun, Tencent Cloud, etc.

  • CI/CD: Jenkins, GitHub, GitLab, etc

  • Programming: Go, Python, Bash

  • Linux: Debian package management, Docker, systemd

  • Networking: Linux firewall, PXE, NFS, ZFS, CIFS

  • Understanding of observability instrumentation techniques and standard methodologies, including Prometheus, Grafana, OpenTelemetry, log system.

Ways to stand out from the crowd:
  • Phenomenal teammate, loves to work in a team environment

  • Worked in tier 1 Autonomous Vehicles companies, automating and accelerating the data driven development close loop for AV

  • Fluent English