What you will be doing?
- Ability to design, implement and improve Grafana, Prometheus, Loki, Promtail, node exporter.
- Log parsing and management.
- Configuration of alerting, push notifications to VictorOps (now Splunk) and Email notifications.
- Architect, design and Implement Icinga 2 monitoring and alerting.
- Ability to monitor system metrics and log parsing.
- Ability to automate tasks using bash and / or Python scripting.
- Predictive monitoring of systems and applications.
- Familiarity with JVM internals and using of JMX and REST for monitoring.
- Familiarity with AWS infrastructure.
- Deep understanding of Java applications, TLS, Apache.
- Automated checks of performance of system metrics in Grafana.
- Automated checks of performance of Web Applications.
- Problem-solving and troubleshooting, including performing root cause analysis to design preventative activities.
- Crafting and maintaining dashboards and reports, pulling together monitoring data across multiple platforms within the same tool as well as across multiple tools.
- Assisting with writing scripts and queries that can provide environment self-healing capabilities.
Have you got what it takes?
- Experience with using monitoring tools in a production environment.
- 5+ years of production cloud operations experience
- 5+ years expertise in Linux command line.
- 5+ years of using Terraform in AWS for automation. Hands on with automation and seeking out opportunities to automate manual processes.
- 5+ years of strong, hands-on experience building production services in AWS.
- 4+ years of experience with scripting using Python and Bash
- Ability to participate in on-call rotation
- Considerable knowledge of IT equipment and diagnostic tools.
- Considerable knowledge of principles and techniques of systems analysis, design, development and programming.
- Considerable knowledge of principles of information systems.
- Cnsiderable knowledge of capabilities of computer technology.
- Knowledge of methods and procedures used to conduct detailed analysis and design of computer systems.
- Knowledge of practices and issues of systems’ security and disaster recovery
- Knowledge of computer operating systems.
Tech Manager, Engineering
Individual Contributor