Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Manager - Storage Production Engineering SRE 
United States, California 
568519573

31.03.2024

As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers bring specialized expertise in areas such as systems, networking, storage, coding, database management, capacity planning, continuous delivery and deployment, and proficiency in open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Your role involves overseeing the implementation of reliable storage solutions, efficient data management, and delivering associated services to uphold the overall stability and performance of production systems.

What You Will Be Doing:

  • Leadership: Formulating and executing strategic initiatives to enhance the reliability and performance of storage systems, aligning with organizational goals.

  • Team Management: Leading and mentoring a team of Storage SRE professionals, fostering a collaborative and innovative work environment.

  • Cloud Storage Expertise: Supervise the planning, execution, and enhancement of storage solutions, encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. Guarantee the efficient utilization of cloud-native storage services offered by platforms like AWS S3 and Azure Blob Storage.

  • System Optimization: Collaborating with multi-functional teams to optimize storage systems, implement best practices, and ensure seamless integration with other technology stacks.

  • Incident Response: Overseeing incident response and resolution for storage-related issues, minimizing downtime, and ensuring a resilient storage environment.

  • Conducting capacity planning exercises and collaborating with team members to forecast and meet storage demands efficiently.

  • Automation and Tooling: Driving automation initiatives to streamline storage operations and developing tools for monitoring, alerting, and performance analysis.

  • Continuous Improvement: Implementing continuous improvement processes to enhance storage systems' overall reliability and efficiency.

What We Need To See:

  • Extensive experience in a senior-level role within Site Reliability Engineering, particularly in managing storage infrastructure.

  • Technical Expertise: In-depth knowledge of storage technologies, file systems, and experience with cloud-based storage solutions. Proficiency in scripting and automation tools is essential.

  • Leadership Skills: Strong leadership and people management skills, with the ability to inspire and guide a team towards achieving common objectives.

  • Problem-Solving Skills: Exceptional analytical and problem-solving skills, with the ability to address complex storage-related issues effectively.

  • Collaboration: Demonstrated ability to collaborate with multi-functional teams and communicate effectively with technical and non-technical collaborators.

  • Prior engineering experience with hands-on coding background in storage systems

  • Master's degree in Computer Science, Information Technology, or a related field or equivalent experience

  • 10+ overall years of relevant experience and 5+ yrs of management experience

Ways to stand out from the crowd:

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success.

  • Professional certifications in relevant technologies (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator). Experience with container orchestration platforms and software-defined storage solutions.

  • Proven track record of implementing and managing storage solutions in a large-scale, enterprise environment. Thrive in collaborative environments and enjoy working with various teams. Flexible in adapting to different working styles.

You will also be eligible for equity and .