Design, develop, and operate highly available and scalable distributed systems.
Collaborate with development teams to implement best practices for CI/CD, infrastructure as code, automated testing and security, etc. to be able to meet scaling demands.
Troubleshoot and debug issues across the entire stack, including application code, networking, and infrastructure.
Build, maintain, and optimize monitoring and alerting solutions to ensure high availability and performance of services. Familiarity with different methodologies (e.g. SLOs, etc.)
Automate repetitive tasks and processes, focusing on reliability and efficiency improvements.
Participate in on-call rotations and incident management processes to ensure rapid resolution of critical issues.
Contribute to team and organizational strategy, participating in architectural reviews and decision-making processes.
Experience: 3+ years of experience in designing, building, and operating reliable distributed systems.
Cloud Expertise: Hands-on experience with a cloud platforms such as Google Cloud Platform (GCP) or Amazon Web Services (AWS).
Strong understanding of core Linux/UNIX operating system fundamentals and TCP/IP and network stack.
Experience operating Kubernetes clusters in production, with an understanding of how containers interact with network and system resources.
Monitoring & Logging: Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack, or similar) as well as how to instrument applications.
Programming Skills: Proficiency in at least one programming language (e.g., Golang, Python, Java, C/C++) and a scripting language (i.e. Bash) with a strong understanding of software development and debugging. Ability to read, understand and contribute to source code.
Bachelor's degree in Computer Science, Electrical or Computer Engineering, or equivalent experien
Preferred Qualifications
Security: Knowledge of security best practices for cloud-based infrastructure.
DevOps Tools: Experience with deploying software to production, implementing and managing CI/CD pipelines, Infrastructure as code, and software release tooling. Familiarity with Helm, Helmfile is a plus.
Database experience: Familiarity with databases (e.g., PostgreSQL, Cassandra, Redis) is a plus.
Team Leadership: Prior experience leading a team of engineers is a plus.
Additional Requirements
A dedicated lifelong learner who is always looking for new things to learn and try.
A professional engineer who loves crafting, analysing and troubleshooting large software systems.
An excellent communicator who builds collaborative relationships with technical and non-technical stakeholders.
Have excellent analytical and problem-solving skills, tenacious in sticking with a problem until it's resolved once and for all.
A great teammate, but you can work on your own initiative as well.
Always actively looking for ways to improve our services, and take personal ownership for the quality of the services we offer.
Demonstrate personal accountability, owning the decisions and mistakes that you make.