Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Microsoft Senior Site Reliability Engineer Manager 
Romania, Bucharest 
232909418

25.06.2024


Required Qualifications:

  • Technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND technical experience in software engineering, network engineering, or systems administration
    • OR Master's Degree in Computer Science, Information Technology, or related field AND technical experience in software engineering, network engineering, or systems administration

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • Technical experience in software engineering, network engineering, or systems administration
    • OR Doctorate Degree in Computer Science, Information Technology, or related field
  • Technical experience working with large-scale cloud or distributed systems
  • People management experience
Responsibilities
  • Demonstrates end-to-end expertise in distributed systems design, interactions between cloud technology layers and components, functions of physical network devices, and dependencies at scale. Drives efforts within an organization to identify and recommend optimal configurations of cloud technology solutions and develops or modifies the code base that defines infrastructures to improve the reliability and operability of supported products.
  • Develops end-to-end technical expertise in the architecture, code, features, and operations of specific products as required to implement improvements in product availability, reliability, efficiency, observability, and/or performance. Drives code/design reviews with the engineering teams that develop and/or manage those products and shares learnings and recommendations across engineering teams working on related products within their organization.
  • Researches and maintains deep knowledge of industry trends and advances in large-scale distributed systems and cloud technologies; manages efforts to research, develop, implement, and optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve the availability, reliability, efficiency, observability, and/or performance of their team's supported products. Monitors the implementation of new tools, technologies, and processes as well as their impact on reliability, efficiency, observability, and/or performance to make recommendations for broader adoption within an organization.
  • Manages partnerships between Site Reliability Engineering (SRE) and product engineering teams to identify and implement changes to the code base to improve availability, reliability, efficiency, observability, and performance of related sets of products within an organization. Reviews and provides feedback on recommendations provided by SREs and ensures they have the technical expertise and data to justify and gain buy-in for their recommendations from product teams and owners.
  • Drives, and contributes to, the development of automation tools to reliably automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale within an organization; reviews existing and newly developed automation tools to evaluate and provide feedback on reusability, extendibility, and scalability. Ensures automation tools and systems developed within an organization are tested and the impact of their deployments is monitored.
  • Oversees a team of Site Reliability Engineers (SREs) using existing tools and/or models to identify contributing factors and points of failure affecting availability, reliability, performance, and/or efficiency of systems, platform, and/or products; provides guidance, recommendations, and feedback to SREs to help them troubleshoot problem and to identify and test scalable solutions that can prevent the occurrence of similar issues in related products within their organization.
  • Participates in on-call rotations and manages teams of Site Reliability Engineers (SREs) responding to incidents during regular on-call rotations to identify the level of impact, troubleshoot issues, and deploy appropriate fixes to resolve root cause(s) and prevent recurrence across related products. Ensures that SREs within an organization have the technical knowledge and resources required to respond to incidents, that relevant engineering teams, stakeholders, leaders are alerted to customer impacting issues, major issues are escalated to other teams as needed, and that key details related to incidents and their resolution are shared through post-mortem reports and during regular review meetings.