Automation:Develop and maintain automation tools and scripts to streamline deployment, monitoring, and management of the infrastructure and applications.
Monitoring and Alerting:Set up and maintain monitoring and alerting systems to proactively identify and resolve issues before they impact customers or services. Including participation in on-call rotations to respond promptly to high priority incidents.
Performance Optimization:Identify opportunities for performance optimization and work with development teams to implement improvements.
Documentation:Maintain up-to-date documentation for the infrastructure, processes, and procedures.
Collaboration:Work closely with development teams, product managers, and other stakeholders to understand requirements and ensure the reliability of the platform.
Continuous Improvement:Participate in post-incident reviews, retrospectives, and other forums to identify areas for improvement and drive continuous improvement initiatives.
· Strong Linux systems engineering background with CentOS/RHEL or Debian including experience building, maintaining and troubleshooting these systems.
· Automation and Scripting: Strong scripting skills (e.g., Bash, Python) and experience with configuration management tools (e.g., Ansible, Chef, Puppet) to automate deployment and management tasks.
· Excellent Git skills (merges, branching, forking)
· Experience with Cloud Platforms: Strong experience with cloud platforms such as IBM, AWS, Azure, or Google Cloud Platform, including expertise in:
o Deploying and managing services in these environments.
o Managing, and troubleshooting containerized applications.
· Troubleshooting and Problem Solving: Strong troubleshooting skills and the ability to quickly identify and resolve complex issues in a production environment, including experience with incident response and post-incident analysis.
Container Orchestration:Proficiency in container orchestration tools such as Nomad or Kubernetes, including experience with Hashicorp Consul/Vault or equivalents.
Monitoring and Logging:Experience with monitoring and logging tools (e.g., ELK stack, Grafana, Prometheus) to monitor the health and performance of infrastructure and applications. Including experience building and maintaining these tools.
Security:Knowledge of implementing security best practices and maintaining compliance standards (Center for Internet Security (CIS) Benchmarks, FedRAMP).
Security:Ability to patch software or adjust configurations to mitigate Common Vulnerabilities and Exposures (CVE) in a timely fashion.
· Experience with clustered time series database technologies such as InfluxDB as well as experience with distributed event streaming platforms using Kafka and Telegraf.
CI/CD:Experience with application deployment using CI/CD tools such as Jenkins and Tekton.
· Working knowledge with GitHub, JIRA, Confluence, and ServiceNow.