Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Nvidia Senior Manager Test Engineering 
Vietnam, Binh Thuan province, Phan Thiết City 
739899994

18.03.2025

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.

  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latencyreduction. Identifyand resolve performance bottlenecks to enhance overall system efficiency.

  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.

  • Implement monitoring and alerting systems to proactively identify and address issues.

  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.

  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.

  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.

  • Proven experience in storage system administration and site reliability engineering.

  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.

  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java

  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.

  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana.

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.

  • Strong Linux and network troubleshooting skills by running various commands and tools.

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..

  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.