THE ROLE:
We are looking for an experienced Cloud Ops/Dev Ops Engineer responsible for securing, monitoring the cloud products/infrastructure. Role would include:
- Troubleshooting and incident analysis.
- Proactive system monitoring and end to end service operations.
- Lead initiatives to automate repetitive tasks, deployments, monitoring, and other operational processes.
- Identify opportunities for cost optimization and performance improvement in cloud environments.
- Develop and enforce strategies to minimize downtime, ensuring high availability and resilience of cloud services.
- Manage incidents, ensuring rapid resolution and minimal impact on business operations.
- Conduct root cause analysis (RCA) for critical incidents and implement preventive measures.
- Manage and mentor a team of cloud engineers and operations staff, providing guidance and support.
- Monitor and analyze performance metrics, KPIs, and SLAs (service level availability) to identify areas for improvement.
- Implement feedback loops to incorporate lessons learned from incidents and operational activities.
- Stay abreast of the latest cloud technologies, trends, and best practices.
- Consistently drive root cause analysis & identify areas of improvement.
- Actively participate in continuous optimization of already existing procedures and processes.
- Drive for automation and standardization.
- Ensure smooth operations and maximize uptime.
- Collaborate with engineering and product management as well as other service groups.
EDUCATION AND QUALIFICATIONS / SKILLS AND COMPETENCIES:
- Bachelor's degree, preferably in Computer Science or Engineering.
- Experience and knowledge in Cloud technologies and Cloud Services (Amazon Web Services, Google Cloud Platform, Microsoft Azure).
- Expertise in analyzing and troubleshooting large-scale distributed systems, databases.
- Knowledge of container solutions such as Kubernetes & Docker.
- Basic understanding with monitoring solutions, such as Grafana, Kibana, Dynatrace.
- Familiar with ITIL concept of Incident Management, Change Management, and Root Cause Analysis.
- Ability to troubleshoot complex problems throughout the whole technology stack.
- Knowledge in Converged Infrastructure or SAP Converged Cloud Infrastructure.
- Datacenter/System migration experience, knowledge of HANA DB, ERP an asset.
- Basic knowledge on CI/CD tools (especially container-oriented environments).
- Proven experience on Kubernetes, Datacenter setup, configuration, and migration.