Role Overview
We are looking for a talented and experienced manager with high technical capabilities. You will play a crucial role in fostering a culture of operational excellence and continuous improvement
In this role, you will
- Be responsible for maintaining the reliability, performance, and availability of our software systems and infrastructure
- Build a team of SREs (future request).
- Work closely with R&D , system administrations and system architects
- To design and implement scalable and robust systems
- To resolve operational issues
- And ensure that NI system meet both internal and external service level objectives (SLOs).
Key Responsibilities:
- System Design and Architecture:
- Design, build, and maintain scalable and reliable infrastructure and systems.
- Develop and implement system monitoring, alerting, and incident response procedures.
- Collaborate with development teams to design and implement software with reliability, scalability, and performance in mind.
- Monitoring and Performance:
- Set up and manage monitoring and alerting systems to proactively identify and address issues.
- Analyze system performance and implement optimizations to improve reliability and efficiency.
- Create dashboards and reports to visualize system performance and reliability metrics.
- Incident Management:
- Respond to and resolve operational incidents, including diagnosing and troubleshooting system issues.
- Participate in on-call rotation to handle production incidents and emergencies.
- Conduct post-incident reviews and implement improvements to prevent recurrence.
- Automation and Optimization:
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
- Develop and maintain configuration management and deployment automation tools.
- Optimize system performance and resource utilization through scripting and automation.
- Capacity Planning and Scaling:
- Perform capacity planning and scaling to ensure that systems can handle current and future workloads.
- Implement strategies for load balancing, failover, and disaster recovery.
- Documentation and Reporting:
- Document processes, and incident resolutions.
- Prepare and present reports on system performance, reliability, and improvement efforts to stakeholders.
- Security and Compliance:
- Ensure systems are secure and comply with relevant regulations and standards.
- Implement and manage security measures, including access controls and data protection.
- Collaboration and Communication:
- Collaborate with cross-functional teams, including software developers, operations, and product managers, to achieve common goals.
- Communicate effectively with both technical and non-technical stakeholders regarding system status, incidents, and improvements.
Qualifications
- Education : Bachelor’s degree in Computer Science, Engineering, or a related field; advanced degree preferred.
- Experience : Minimum of [5-7] years of experience in Site Reliability Engineering or a related role, with at least [2-4] years in a managerial or leadership position.
- Technical Skills :
- Strong knowledge of distributed systems, cloud platforms, and modern infrastructure technologies.
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack).
- Proficiency in programming/scripting languages (e.g., Python, Go, Bash).
- Understanding of incident management processes and root cause analysis.
- Leadership Skills : Proven experience in leading and managing teams, with strong mentoring and coaching abilities.
- Communication Skills : Excellent verbal and written communication skills, with the ability to effectively present technical information to diverse audiences.
- Problem-Solving : Strong analytical and problem-solving skills, with the ability to address complex technical challenges and drive solutions.
Preferred Qualifications:
·Certifications:Relevant certifications (e.g., AWS Certified Solutions Architect, Google Cloud Professional DevOps Engineer).
·Experience with CI/CD:Familiarity with continuous integration and continuous deployment (CI/CD) practices and tools.
·Understanding of DevOps methodologies and best practices.
Why Join Us?
- Impactful Work : Ensuring the reliability and performance of critical systems and services.
- Innovative Culture : Work in an environment that values continuous improvement, innovation, and collaboration.
- Career Growth : Access to professional development opportunities and career advancement within a growing organization.