Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Red hat Principal SRE - AI Platforms 
India, Karnataka 
403227646

Yesterday

What will you do?

  • Working with live systems and coding automation

  • Design, build, and manage our large scale infrastructure and platform services, including public cloud, private cloud, and datacenter-based

  • Automate cloud infrastructure through use of technologies (e.g. auto scaling, load balancing, etc.), scripting (bash, python and golang), monitoring and alerting solutions (e.g. Splunk, Splunk IM, Prometheus, Grafana, Catchpoint etc)

  • Design, develop, and become expert in AI capabilities leveraging emerging industry standards

  • Breakdown complex engineering efforts into consumable chunks while working with teams to understand deliverables

  • Design and development of software like Kubernetes operators, webhooks, cli-tools

  • Implement and maintain intelligent infrastructure and application monitoring designed to enable application engineering teams

  • Ensure the production environment is operating in accordance with established procedures and best practices

  • Lead escalation support for high severity and critical platform-impacting events

  • Provide feedback around bugs and feature improvements to the various Red Hat Product Engineering teams

  • Design software tests and lead peer reviews to increase the quality of our codebase

  • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration

  • Participate in a regular on-call schedule, supporting the operation needs of our tenants

  • Drive sustainable incident response and lead blameless postmortems

  • Work within a small agile team to develop and improve SRE methodologies, support your peers, plan and self-improve

What will you bring?

  • 5+ years of experience of using cloud providers and technologies (Google, Azure, Amazon, OpenStack etc)

  • 4+ years of experience administering a kubernetes based production environment

  • 5+ years of experience with enterprise systems monitoring

  • 5+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef

  • 5+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred

  • 5+ years of experience delivering a hosted service

  • Demonstrated ability to quickly and accurately troubleshoot system issues, assess risks, and support teams through resolution

  • Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP

  • Demonstrated comfort with leading collaboration, open communication and reaching across functional and organizational boundaries

  • Passion for understanding and anticipating users’ needs and delivering outstanding user experiences

  • Independent problem-solving and self-direction; motivated to develop these skills in others

  • Desires to solve problems through collaboration as part of a global team

  • Experience leading teams working with Agile development methodologies

  • Bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required

  • Hands-on experience that demonstrates your ability and interest in Site Reliability Engineer

  • Experience programming in at least one of these languages: Python, Golang, Java, C, C++ or another object-oriented language

  • Experience working with public clouds such as AWS, GCP, or Azure

  • Collaboratively troubleshoot and solve problems in a team setting

  • Experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.)

  • Experience working with complex distributed systems

  • Demonstrated ability to debug, optimize code and automate routine tasks

  • Basic understanding of Unix/Linux operating systems