So, what’s the role all about?
- Run the production environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
How will you make an impact?
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
Have you got what it takes?
- 3-6 years of working experience in a similar role, with a focus on systems engineering, automation, and reliability.
- Proficiency in at least one programming language (e.g., Python, Go, Java, C#) and experience with scripting languages (e.g., Bash, PowerShell).
- Deep understanding of cloud computing platforms (e.g., AWS), the working and reliability constraints of some of the prominent services (e.g., EC2, ECS, Lambda, DynamoDB etc)
- Experience with infrastructure as code tools such as CloudFormation, Terraform.
- Deep understanding of CI/CD concepts and experience with CI/CD tools such as Jenkins, GitLab CI/CD, or CircleCI.
- Strong knowledge of containerization technologies (e.g., Docker, Kubernetes) and microservices architecture.
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Cloudwatch).
- Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
- Experience of Incident management and blameless postmortems that includes driving the incident response efforts during outages and other critical incidents, resolution, and communication in a cross-functional team setup.
You will have an advantage if you also have:
- Handson experience of working with large Kubernetes Cluster. Certification will be an added plus.
- Working experience of Grafana Observability Suite (Loki, Mimir, Tempo).
- Administration and/or development experience of standard monitoring and automation tools such as Splunk, Datadog, Pagerduty Rundeck.
- Familiarity with configuration management tools like Ansible, Puppet, or Chef.
- Certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or equivalent.
Personal attributes:
- Strong communication skills and the ability to collaborate effectively with cross-functional teams.
- Team player - ability to work well in a close team environment.
- Fast learner with ability to educate her/himself on relevant technologies
- Ability to multitask and prioritize work
- Ability to remain focused and calm under pressure
Requisition ID: 6725.
Reporting into: Director, Network Operations.
Role Type: Individual Contributor.