They will be utilizing Observability and Monitoring tools to detect and resolves issues effecting positive user experience
The engineer will also be responsible for automating alerting and remediation processes to reduce mean time to resolution (MTTR) and improve system uptime
Splunk query language and Monitored Database Connection Health by using Splunk DB connect health dashboards, log parsing, complex Splunk searches, including external table lookups, Splunk data flow, components, features and product capability.
Observability: Implement comprehensive monitoring and alerting solutions using GCP monitoring services and external services
Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
Build vital and efficient tooling to lower the barrier of entrance for engineering teams to plug in and enjoy the benefits of Reliability focused on Observability.
Configure dashboards, alerts, and notifications to ensure timely identification and resolution of issues.
Troubleshoot issues and outages, working closely with development and operations teams to identify root causes and develop solutions
Monitor Server, network infrastructure and application performance metrics, and identify patterns and trends to improve system performance and reliability
Develop and integrate tools for logging, monitoring, and alerting to enhance visibility into system performance
Participate in strategic planning for the technology roadmap, including scalability, cost-effectiveness, and risk management considerations related to observability infrastructure
BACKGROUND REQUIREMENTS
6+ years of SRE observability engineering experience
6+ years of experience in observability best practices working with Dynatrace or similar tools (NewRelic, DataDog, AppDynamics, or other similar APM suites), delivering solutions across all environments, and integrating platforms and applications with monitoring and APM tools.
Knowledge of CI/CD tools such as Puppet, Jenkins, Terraform, Ansible
Should have a minimum 4 to 5 years' working experience in OpenShift and Docker/K8s
Proficiency in implementing monitoring and observability solutions using GCP monitoring services such as Cloud Monitoring, Logging, and Tracing
Deep understanding of IT infrastructure monitoring and observability best practices
Experience with gathering and organizing large amounts of data to use for instrumentation into an Enterprise monitoring solution.
Experience with recommending baseline monitoring thresholds and performance monitoring KPIs and SLAs
Experience of at least 4 + years of experience in development of Grafana Dashboards, develop Metrics / monitoring Standardization - Metrics, collection, Dashboards with Grafana a must
3-5 years of experience with SQL and familiarity with at least one managed Kubernetes platforms (EKS, AKS, GKE)
Strong background in software engineering, with expertise in relevant programming languages (like Python, Java, Go) and cloud platforms (like AWS, GCP, Azure)
Experience with container orchestration tools like Kubernetes
COMPETENCIES AND SKILLS
Strong interpersonal, and organizational skills
Strong verbal and written skills
Attention to detail
Excellent time management
Extraordinary teamwork and collaborative skills-Own Working Together