Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Compute Cluster SRE Engineer GPU - HPC 
India, Karnataka, Bengaluru 
908074932

24.06.2024

For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning. This new model - where deep neural networks are trained to recognize patterns from massive amounts of data - has shown to be deeply effective at solving some of the most complex problems in everyday life.

What you will be doing:

  • Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime.

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management.

  • Support services before they go live through activities such as capacity management, providing best possible user support issues.

  • Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health.

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

  • Practice sustainable incident response and blameless postmortems.

  • Understand complex and vast infrastructure and support it during on call weeks.

  • Work with different SME and help provide quality resolution to the production issues to the customer.

What we need to see:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent.

  • 4+ years of hands-on industry experience in the above-mentioned areas

  • Must have experience with Linux system administration (Ubuntu , Centos/Redhat)

  • Must have HPC cluster scheduler experience in setup and administration like SLURM &/ LSF.

  • Experience in one or more of the following: Python, Perl, Bash .

  • Good understanding of open-source IT Automation tools like Ansible .

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

  • Ability to debug and optimize code and automate routine tasks.

Ways to stand out of the crowd:

  • Experience of Bright Cluster Manager ( BCM )

  • Understanding on InfiniBand or Ethernet concepts.

  • Experience with high-speed storage solutions such as Lustre, GPFS.

  • Experience with MPI , Pytorch