Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

NetApp Site Reliability Engineer 
United States, North Carolina 
566220410

13.08.2024
Job Summary

As a Site Reliability Engineer, you will be operating at the intersection of development and operations. Your role will involve engaging in and enhancing the lifecycle of cloud services - from design through deployment, operation, and refinement. You will be responsible for maintaining these services by measuring and monitoring their availability, latency, and overall system health.

You will play a crucial role in sustainably scaling systems through automation and driving changes that improve reliability and velocity. As part of your responsibilities, you will administer cloud-based environments that support our SaaS/IaaS offerings, which are implemented on a microservices, container-based architecture (Kubernetes).

In addition, you will oversee a portfolio of customer-centric cloud services (SaaS/IaaS), ensuring their overall availability, performance, and security. You will work closely with both NetApp and cloud service provider teams, including those from Azure, located across the globe in regions such as RTP, Reykjavík, Bangalore, Sunnyvale, Redmond, and more.

Job Requirements
  • Incident Response and Troubleshooting: Address and perform root cause analysis (RCA) of complex live production incidents and cross-platform issues involving OS, Networking, and Database in cloud-based SaaS/IaaS environments. Implement SRE best practices for effective resolution.
  • Analysis, and Infrastructure Maintenance: Continuously monitor, analyze, and measure system health, availability, and latency using tools like Prometheus, Grafana, and others. Develop strategies to enhance system and application performance, availability, and reliability. Inaddition, maintain and monitor the deployment and orchestration of servers, docker containers, databases, and general backend infrastructure.
  • Automation and Efficiency: Identify tasks and areas where automation can be applied to achieve time efficiencies and risk reduction. Deploy these automation improvements using deployment automation.
  • Issue Tracking and Resolution: Use Atlassian Jira, Azure DevOps and related Incident Management tooling track and resolve issues based on their priority.
  • Document system knowledge as you acquire it, create runbooks, and ensure critical system information is readily accessible.
  • Team Collaboration and Influence: Work in tandem with other Cloud Infrastructure Engineers and developers to ensure maximum performance, reliability, and automation of our deployments and infrastructure. Additionally, consult and influence developers on new feature development and software architecture to ensure scalability.
  • Debugging, Troubleshooting, and Advanced Support: Undertake debugging and troubleshooting of service bottlenecks throughout the entire software stack. Additionally, provide advanced tier 2 and 3 support for NetApp's First Party Cloud solutions.
  • Security Management: Stay updated with security protocols and proactively identify, diagnose, and resolve complex security issues.
  • Directly influence the decisions and outcomes related to solution implementation: measure and monitor availability, latency, and overall system health.
  • Demonstrated experience in scripting or coding languages such as Python, PowerShell, C#, or Go.
  • Deep working knowledge of containers, Kubernetes, and serverless computing implementations.
  • Familiar with DevOps development methodologies.
  • Experience with cloud platforms such as Azure, AWS, or Google Cloud.

Typically requires a minimum of 8 years of related experience.


Did you know...

If you want to help us build knowledge and solve big problems, let's talk.