Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Cognyte Site Reliability Engineering Manager 
Romania, Bucharest 
447668947

30.08.2024

Role Overview

We are looking for a talented and experienced manager with high technical capabilities. You will play a crucial role in fostering a culture of operational excellence and continuous improvement

In this role, you will

  1. Be responsible for maintaining the reliability, performance, and availability of our software systems and infrastructure
  2. Build a team of SREs (future request).
  3. Work closely with R&D , system administrations and system architects
  4. To design and implement scalable and robust systems
  5. To resolve operational issues
  6. And ensure that NI system meet both internal and external service level objectives (SLOs).

Key Responsibilities:

  1. System Design and Architecture:
  • Design, build, and maintain scalable and reliable infrastructure and systems.
  • Develop and implement system monitoring, alerting, and incident response procedures.
  • Collaborate with development teams to design and implement software with reliability, scalability, and performance in mind.
  1. Monitoring and Performance:
  • Set up and manage monitoring and alerting systems to proactively identify and address issues.
  • Analyze system performance and implement optimizations to improve reliability and efficiency.
  • Create dashboards and reports to visualize system performance and reliability metrics.
  1. Incident Management:
  • Respond to and resolve operational incidents, including diagnosing and troubleshooting system issues.
  • Participate in on-call rotation to handle production incidents and emergencies.
  • Conduct post-incident reviews and implement improvements to prevent recurrence.
  1. Automation and Optimization:
  • Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
  • Develop and maintain configuration management and deployment automation tools.
  • Optimize system performance and resource utilization through scripting and automation.
  1. Capacity Planning and Scaling:
  • Perform capacity planning and scaling to ensure that systems can handle current and future workloads.
  • Implement strategies for load balancing, failover, and disaster recovery.
  1. Documentation and Reporting:
  • Document processes, and incident resolutions.
  • Prepare and present reports on system performance, reliability, and improvement efforts to stakeholders.
  1. Security and Compliance:
  • Ensure systems are secure and comply with relevant regulations and standards.
  • Implement and manage security measures, including access controls and data protection.
  1. Collaboration and Communication:
  • Collaborate with cross-functional teams, including software developers, operations, and product managers, to achieve common goals.
  • Communicate effectively with both technical and non-technical stakeholders regarding system status, incidents, and improvements.

Qualifications

  • Education : Bachelor’s degree in Computer Science, Engineering, or a related field; advanced degree preferred.
  • Experience : Minimum of [5-7] years of experience in Site Reliability Engineering or a related role, with at least [2-4] years in a managerial or leadership position.
  • Technical Skills :
  • Strong knowledge of distributed systems, cloud platforms, and modern infrastructure technologies.
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack).
  • Proficiency in programming/scripting languages (e.g., Python, Go, Bash).
  • Understanding of incident management processes and root cause analysis.
  • Leadership Skills : Proven experience in leading and managing teams, with strong mentoring and coaching abilities.
  • Communication Skills : Excellent verbal and written communication skills, with the ability to effectively present technical information to diverse audiences.
  • Problem-Solving : Strong analytical and problem-solving skills, with the ability to address complex technical challenges and drive solutions.

Preferred Qualifications:

·Certifications:Relevant certifications (e.g., AWS Certified Solutions Architect, Google Cloud Professional DevOps Engineer).

·Experience with CI/CD:Familiarity with continuous integration and continuous deployment (CI/CD) practices and tools.

·Understanding of DevOps methodologies and best practices.


Why Join Us?

  • Impactful Work : Ensuring the reliability and performance of critical systems and services.
  • Innovative Culture : Work in an environment that values continuous improvement, innovation, and collaboration.
  • Career Growth : Access to professional development opportunities and career advancement within a growing organization.

Apply now.