Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Mcafee Lead Site Reliability Engineer 
India, Karnataka, Bengaluru 
621794115

13.12.2024
As a Site Reliability Engineer (SRE) Technical Lead, you will be instrumental in overseeing the reliability, availability, and performance of our production environments at an advanced level. You will lead initiatives in proactive monitoring and management of incidents, fostering a culture of rapid resolution and minimal service disruption. Your extensive troubleshooting, log data analysis and debugging skills will facilitate close collaboration with DevOps, Engineering, and internal support teams, allowing us to achieve the highest levels of customer satisfaction.


Key responsibilities include:

- Proficient in AWS (Amazon Web Service) Cloud technology and have good hands on experience on some of the major services, ALB, NLB, Athena, VPC, EC2, RDS and Cloudwatch along with good experience in Athena log query analysis.

- Effectively drive the APM monitoring solution POC or having good hands-on experience on Prometheus and Grafana monitoring setup on microservice based environment.

- Provide tailor made monitoring solution to support critical consumer-based environment through elimination of false positives and ability to script in PMQL or any monitoring programming language to automate the monitoring capability.

- Lead efforts to troubleshoot, debug, and escalate issues with thorough details CloudWatch Log analysis, enhancing overall service availability and reliability.

- Provide detailed analysis by pulling various log metrics and give in-depth insight of the issue by relating with various trends and frequency of web/API based troubleshooting.

- Leverage your extensive experience in the AWS cloud computing platform, including EC2, S3, EBS, VPC, ELB, AMI, SNS, RDS, IAM, Route 53, and Auto Scaling, to drive service scalability and performance optimization.

- Independently analyze the cost utilization related to various AWS service and come up with suggestion for optimization with proper implementation.

- Oversee the deployment of code updates across test and production environments, facilitating seamless rollouts of enhancements.

- Track and escalate all critical production issues through designated tracking applications, maintaining the integrity of service delivery.

- Manage GitHub PR requests, ensuring efficient triggering and analysis of pipelines for Kubernetes configuration changes.

- Spearhead root cause analyses for production incidents, implementing and advocating for long-term solutions to persistent challenges.

- Exhibit robust hands-on troubleshooting expertise with Kubernetes clusters, Pods, and services.

Key qualifications include:

- 8 + years of experience in the web and e-commerce domain, with a specific focus on cloud hosting (primarily AWS).

- Good Hands-on experience on Prometheus/Grafana or any APM tool to tweak and tune the monitoring capabilities based on requirement.

- A track record of innovative thinking and a willingness to propose and initiate significant service improvements based on data-driven analyses.

- Experience in developing automation tools or scripts to optimize processes and reduce manual interventions and leverage Athena or Log-insight query.

- An adaptive mindset to changes, along with a strong interest in exploring and implementing the latest technologies.

- Self-motivated, results-oriented, and strategically adept, with an ability to drive meaningful improvements in service delivery.

We work hard to embrace diversity and inclusion and encourage everyone at McAfee to bring their authentic selves to work every day. We offer a variety of social programs, flexible work hours and family-friendly benefits to all of our employees.

  • Bonus Program
  • Pension and Retirement Plans
  • Medical, Dental and Vision Coverage
  • Paid Time Off
  • Paid Parental Leave
  • Support for Community Involvement