Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Microsoft Principal Site Reliability Engineer 
United States 
163840951

25.06.2024


We are looking for a self-driven Principal Site Reliability Engineer (SRE) who likes taking a data driven and systems-based approach to solve Service Reliability problems. You will be responsible for building and optimizing solutions that can analyze massive amounts of telemetry and other Service Health indicators in near real time and perform automated root cause analysis and necessary mitigations to restore SLO’s.

Required/Minimum Qualifications

  • 8+ years technical experience in software engineering, network engineering, or systems administration.
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration.
    • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration.
    • OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.
  • 6+ years of experience running large scale cloud services.
  • 3+ years of operational experience in improving Service Reliability, Availability and Performance.
  • 5+ years of hands-on experience in Python/Java/C#.

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:

  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred/Additional Qualifications

  • Understanding of Observability and MELT implementation patterns for large-scale services.
  • Experience in Logic Apps and authoring Jupyter Notebooks.
  • Experience in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity.
  • Ability to deal with the ambiguity associated with working in a fast-paced environment.
  • Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity.
  • Ability to deal with the ambiguity associated with working in a fast-paced environment.
  • Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

Microsoft will accept applications for the role until June 27th, 2024.

Responsibilities
  • Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO’s and averting incidents altogether when possible.
  • Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a sustainable way.
    Communicate on a deeply technical level and be the single point of contact for interfacing with large enterprise customers for handling service escalations and driving the issues to resolution.
  • Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available.
  • Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.
  • Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.
  • Embody our and