Perform full lifecycle DevOps activities (design, develop, test, implement, maintain) for new and existing observability tools/platforms in an enterprise environment with on-premises and cloud-integrated systems/applications/infrastructure.
In support of Observability Platform proof-of-concepts, pilots, and production implementations: Gather functional/business requirements, establish success criteria, and develop use cases.
Proactively design telemetry strategies to gain real-time insights and identify potential issues before they escalate.
Apply knowledge and experience to the following:
Telemetry data collection, analysis, and implementation to derive meaningful insights from different sources including metrics, events, logs, and traces.
Distributed systems including those with microservices and hybrid infrastructure (cloud/on-premises) to effectively design telemetry pipelines, build monitoring systems, and implement observability practices.
Identify patterns, detect anomalies, troubleshooting incidents and build a holistic understanding of system/application/infrastructure behavior to optimize resource allocation, enhance user experience, support compliance and security requirements.
Collaborate across different observability domains to include Infrastructure, Applications (APM), Networking and close these gaps with the cross-functional skills.
Assist in development of observability backup recovery methodologies.
Work with Team Lead and/or Observability Project Manager to prioritize efforts and meet deliverable timelines as well as participating in briefing program leadership and liaising with government customers and other stakeholders.
Leverage approved systems for incident/change management (SNOW), work items (Jazz/Jira), documentation (Confluence), and others.
Qualifications:
5+ yrs of relevant experience
Three plus (3+) years observability platform/tools experience.
Subject matter expertise in telemetry.
Familiarity with integration architecture methods (RESTful, RPC).
Familiarity with Java or Python programming and/or code debugging/testing.
Familiarity with software development lifecycle.
Proficiency in scripting languages (Bash).
Experience with underlying databases used by observability platforms.
Experience utilizing container technologies like Kubernetes, Docker, or similar.
Solid understanding of networking concepts, protocols, and troubleshooting techniques.
Experience using observability tools such as logging and metrics for debugging (Prometheus, Grafana, Elastic/Kibana).
Proficiency in production Cloud infrastructure (AWS, GCP, or Azure)
Desirable Expertise or Qualifications:
Expertise in one or more of the following Observability platforms: Cisco AppDynamics, Datadog, Dynatrace, Splunk, others.
Experience with database technologies like SQL, Oracle DB, Apache/IIS, RHEL/OL.
Bachelor’s degree in STEM field.
Experience with Infrastructure as Code tools for provisioning infrastructure such as Terraform, CloudFormation, or similar