In-depth technical experience in software engineering, network engineering, or systems administration
Operational experience in improving Service Reliability, Availability and Performance
Ability to deal with the ambiguity associated with working in a fast-paced environment
Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity
Expertise in analysing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems.
Ability to travel to customer site on a regular basis in South East UK
PREFERRED QUALIFICATIONS
Prior HPC knowledge
Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product
Other Requirements
The ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter. UK Baseline Personnel Security Standards; UK Security Clearance
Responsibilities
Collaborating closely with the existing SRE teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO’s and averting incidents altogether when possible.
Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a sustainable way.
Communicate on a deeply technical level and be the single point of contact for interfacing with a large enterprise customer, for handling service escalations and driving the issues to resolution.
Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available.
Enhancing customer facing experience by proactive alerting based on utilisation, trends, resource health, etc.
Analyse data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.