Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Red hat Site Reliability Engineer - AI Platforms 
India, Karnataka 
54614245

17.04.2025

Job Description

What will you do:

  • Working with live systems and coding automation

  • Build and manage our large scale infrastructure and platform services, including public cloud, private cloud, and datacenter-based

  • Automate cloud infrastructure through use of technologies (e.g. auto scaling, load balancing, etc.), scripting (bash, python and golang), monitoring and alerting solutions (e.g. Splunk, Splunk IM, Prometheus, Grafana, Catchpoint etc)

  • Design, develop, and become expert in AI capabilities leveraging emerging industry standards

  • Participate in the design and development of software like Kubernetes operators, webhooks, cli-tools

  • Implement and maintain intelligent infrastructure and application monitoring designed to enable application engineering teams

  • Ensure the production environment is operating in accordance with established procedures and best practices

  • Provide escalation support for high severity and critical platform-impacting events

  • Provide feedback around bugs and feature improvements to the various Red Hat Product Engineering teams

  • Contribute software tests and participate in peer review to increase the quality of our codebase

  • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration

  • Participate in a regular on-call schedule, supporting the operation needs of our tenants

  • Practice sustainable incident response and blameless postmortems

  • Work within a small agile team to develop and improve SRE methodologies, support your peers, plan and self-improve

What will you bring:

  • 3+ years of experience of using cloud providers and technologies (Google, Azure, Amazon, OpenStack etc)

  • 1+ years of experience administering a kubernetes based production environment

  • 2+ years of experience with enterprise systems monitoring

  • 2+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef

  • 2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred

  • 2+ years of experience delivering a hosted service

  • Demonstrated ability to quickly and accurately troubleshoot system issues

  • Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP

  • Demonstrated comfort with collaboration, open communication and reaching across functional boundaries

  • Passion for understanding users’ needs and delivering outstanding user experiences

  • Independent problem-solving and self-direction

  • Works well alone and as part of a global team

  • Experience working with Agile development methodologies

  • Bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required

  • Hands-on experience that demonstrates your ability and interest in Site Reliability Engineering

  • Experience programming in at least one of these languages: Python, Golang, Java, C, C++ or another object-oriented language

  • Experience working with public clouds such as AWS, GCP, or Azure

  • Collaboratively troubleshoot and solve problems in a team setting

  • Experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems

  • Demonstrated ability to debug, optimize code and automate routine tasks

  • Basic understanding of Unix/Linux operating systems