Your Role and Responsibilities- Automation: Develop and maintain automation tools and scripts to streamline deployment, monitoring, and management of the infrastructure and applications.
- Monitoring and Alerting: Set up and maintain monitoring and alerting systems to proactively identify and resolve issues before they impact customers or services. Including participation in on-call rotations to respond promptly to high priority incidents.
- Performance Optimization: Identify opportunities for performance optimization and work with development teams to implement improvements.
- Documentation: Maintain up-to-date documentation for the infrastructure, processes, and procedures.
- Collaboration: Work closely with development teams, product managers, and other stakeholders to understand requirements and ensure the reliability of the platform.
- Continuous Improvement: Participate in post-incident reviews, retrospectives, and other forums to identify areas for improvement and drive continuous improvement initiatives.
Required Technical and Professional Expertise
- Strong Linux systems engineering background with CentOS/RHEL or Debian including experience building, maintaining and troubleshooting these systems.
- Automation and Scripting: Strong scripting skills (e.g., Bash, Python) and experience with configuration management tools (e.g., Ansible, Chef, Puppet) to automate deployment and management tasks.
- Excellent Git skills (merges, branching, forking)
- Experience with Cloud Platforms: Strong experience with cloud platforms such as IBM, AWS, Azure, or Google Cloud Platform, including expertise in:
- Deploying and managing services in these environments.
- Managing, and troubleshooting containerized applications.
- Troubleshooting and Problem Solving: Strong troubleshooting skills and the ability to quickly identify and resolve complex issues in a production environment, including experience with incident response and post-incident analysis.
Preferred Technical and Professional Expertise
- DevOps Culture: Experience working in a DevOps culture and mindset, including a strong understanding of the collaboration between development and operations teams to achieve business goals.
- Container Orchestration: Proficiency in container orchestration tools such as Nomad or Kubernetes, including experience with Hashicorp Consul/Vault or equivalents.
- Monitoring and Logging: Experience with monitoring and logging tools (e.g., ELK stack, Grafana, Prometheus) to monitor the health and performance of infrastructure and applications. Including experience building and maintaining these tools.
- Security: Knowledge of implementing security best practices and maintaining compliance standards (Center for Internet Security (CIS) Benchmarks, FedRAMP).
- Security: Ability to patch software or adjust configurations to mitigate Common Vulnerabilities and Exposures (CVE) in a timely fashion.
- Experience with clustered time series database technologies such as InfluxDB as well as experience with distributed event streaming platforms using Kafka and Telegraf.
- CI/CD: Experience with application deployment using CI/CD tools such as Jenkins and Tekton.
- Working knowledge with GitHub, JIRA, Confluence, and ServiceNow.