Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior Service Reliability Operations Administrator 
United States, Texas 
290524210

Today
US, CA, Santa Clara
US, Remote
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

What you will be doing:

  • The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States.

  • Some CIS shifts require either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.

  • Every CIS team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.

  • Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.

  • CIS team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.

  • Help discover incidents and issues, including initiating the incident management procedure. Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.

  • Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort.

  • May perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

  • 5+ years of experience administering open system servers in a Production environment. 3+ years of experience working in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.

  • B.S. in relevant disciplines or equivalent experience.

  • Expertise using monitoring tools and problem ticketing systems.

  • Strong problem-solving, analytical, and troubleshooting abilities.

  • Strong server administration experience. Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc. RHCE or equivalent level of knowledge.

  • Experience scripting in Python preferred, but not required. Prior experience running virtual machines under open source or commercial hypervisors. Experience operating services running on public or private clouds.

  • Knowledge and understanding of application containers and container orchestration systems. Basic understanding of Git.

  • Experience performing system administration tasks using Ansible. Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.

  • Demonstrate ability to master and maintain complicated environments.

You will also be eligible for equity and .