Design, develop and maintain observability solutions & tools, including monitoring, logging, and alerting systems, to improve system visibility and performance
Create dashboards & automated alerts using tools such as Grafana, Prometheus, Splunk, Catchpoint to enhance monitoring frameworks & ensure proactive issue detection and resolution
Analyze system metrics and logs to identify bottlenecks, optimize application performance, and ensure system reliability end-to-end while scaling
Partner with developers, DevOps engineers, and AI Infra teams to integrate observability best practices into the development and deployment lifecycle
Assist in troubleshooting and resolving production issues by leveraging observability data to identify root causes and implement preventative measures
Develop scripts or workflows to automate routine tasks and improve observability tool integrations
Create and maintain documentation for observability tools, processes, and workflows to ensure knowledge sharing and accessibility
What You’ll Bring
3+ years of experience in software engineering, DevOps, or SRE roles with a focus on observability or monitoring
Proficiency in monitoring and visualization tools (e.g., Prometheus, Grafana, Splunk, Catchpoint)
Strong analytical and troubleshooting skills with a focus on system performance and reliability
Working knowledge for High performance computing, Slurm, GPU architecture & Networking
Working knowledge of logging systems and distributed tracing frameworks such as OpenTelemetry
Expertise in scripting languages (e.g., Python, Bash) and familiarity with configuration management tools (e.g., Terraform, Ansible)
Experience with containerized environments (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, Azure)
Strong analytical and troubleshooting skills with a focus on system performance and reliability
Excellent verbal and written communication skills, with the ability to collaborate effectively across teams
Bachelor’s Degree in Computer Science, Software Engineering, or a related field, or equivalent experience