Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior SRE Engineer 
United States, California 
329910941

12.08.2024

What you’ll be doing:

  • Kubernetes administration for DevOps & CI/CD. Designing and implementing clusters, cluster segmentation, internal/external networking for multiple clusters and environments.

  • Managing and administrating multiple Jenkins instances. Making sure that Jenkins and plugins health is maintained.

  • Monitoring critical metrics and making sure that team’s SLAs are met.

  • Play a critical role in ensuring that our platform is easy to use, reliable, scalable, and resistant to disruptions.

  • Drive automation of monitoring to gain more insight into applications and system health.

  • Craft and develop tools needed for automating workflows.

  • Develop, Improve and Maintain our infrastructure codebase. Providing high quality of user support. Craft and implement critical metrics using various analytics methods and dashboards. Take part in prototyping, crafting, and developing cloud infrastructure for NVIDIA.

  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.

What we need to see:

  • Kubernetes domain expertise with extensive experience building scalable, resilient platforms in both public and private cloud capable of providing platform engineering / architecture standard methodologies (including experience with architecting and implementing the overall platform, orchestration, security, and monitoring ecosystem)

  • High proficiency in administering and configuring Kubernetes.

  • Proficient with administrating and maintaining Jenkins. Experience with scaling Jenkins and maintaining plugin compliance.

  • Experience with dataanalytics/visualizationtools like Kibana, Grafana, Splunk etc.

  • Strong Ansible skills. Experience with other configuration tools like Chef and Puppet is also good to have.

  • Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce etc. Knowledge of monitoring systems such as Zabbix, Alertmanager, PagerDuty and/or similar systems.

  • Well versed in Prometheus, writing custom exporters and PromQL.

  • 5+ years of proven experience.

  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Ways to stand out from the crowd:

  • Previous experience with SRE teams handling on-prem infrastructure.

  • Experience handling NVIDIA hardware like GPUs and Tegras.

  • Solid understanding of containerization and microservices architecture. Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred.

  • Outstanding interpersonal skills and communication with all levels of management.

You will also be eligible for equity and .