Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Cisco Senior Site Reliability Engineer 
Portugal 
497179279

27.01.2025
Key Responsibilities
  • Identify and provide solutions to common obstacles hindering operational excellence across engineering teams.
  • Partner with application developers using cloud-native tools to address novel challenges around scale, performance, and reliability.
  • Generalize and standardize solutions and processes to enable repeated success across our microservice-based multi-region platform.
  • Play a key role in the ThousandEyes platform by leveraging scale testing, additional environments, and working with application teams to improve system reliability.
  • Use cloud-native observability and reliability tools such as Prometheus, Istio, and ArgoCD.
  • Manage a rapidly growing infrastructure capable of handling substantial daily data volumes, emphasizing operations/infrastructure/everything as code.
What You’ll Do
  • Collaborate with software engineers to ensure architecture and services are optimized for availability, latency, and performance.
  • Design and implement scalable operations tooling to support platform growth and scaling across multiple regions.
  • Design, deploy, and maintain AWS cloud-native services that are elastic and resilient to failure.
  • Participate in and improve our 24x7 incident response and on-call rotation.
  • Use and expand our existing CNCF solutions like Kubernetes, Service Mesh, Prometheus, OpenTelemetry, and ArgoCD to increase platform reliability.
  • Automate production operations to provide guardrails and continuous platform operation.
  • Develop automation solutions for scalable service and platform operations, including deployment, scale testing, graceful failure, and chaos testing.
  • Stay updated on industry best practices for scalability and reliability to improve the scalability of the ThousandEyes platform.
Required Qualifications
  • Expert-level knowledge of Kubernetes and its ecosystem.
  • Proficiency in software development with languages such as Python or Go.
  • In-depth knowledge of cloud providers, preferably AWS.
  • Proven ability to build and implement scalable and well-tested solutions.
  • Strong understanding of Unix/Linux systems, including kernel, system libraries, file systems, and client-server protocols.
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.
  • Excellent communication and documentation skills.
  • Strong sense of ownership, drive, and attention to detail.
Preferred Qualifications
  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.
  • 5+ years of experience in a related role.