Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Service Reliability Operations Engineer 
India, Karnataka, Bengaluru 
688455433

01.12.2024

What you will be doing:

  • The team will provide their services 24/7 with a follow-the-sun environment which will span continents.

  • You will directly report to a manager in Bangalore.

  • Each team member will need to work either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4days-per-week schedule) to ensure that the combination of the US and India teams provide 24/7 coverage.

  • The heart of Mission Control will be monitoring and triaging a growing On-prem and CSP (Cloud Service Provider) production compute and storage Datacenter environment.

  • Every Mission Control team member will utilize alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and execute predictive support or diagnostic routines.

  • Perform Linux administration tasks, network administration tasks, security incident monitoring to drive your actions.

  • Mission Control team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.

  • Strong communication and interpersonal skills will help keep the team engaged through incident resolution, including initiating the incident management procedure.

What we need to see:

  • BS/BE degree in Computer Science, Electronics or equivalent experience.

  • Minimum of 3 years’ experience administering open system servers in a Production environment of demanding Internet, Cloud, or Telecommunications environments as a Linux Systems Administration, DevOps, SRE, or NOC role.

  • Strong problem-solving, analytical, and troubleshooting abilities on Linux Clusters on public or private clouds.

  • Strong Linux administration experience. Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc. RHCE or equivalent level of knowledge.

  • Experience scripting in Python and ansible playbooks is preferred, but not required.

  • Knowledge and understanding of application containers, container orchestration systems and git workflow..

  • Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.

  • Demonstrate ability to master and maintain complicated environments.