Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Microsoft Site Reliability Engineer II - CTJ Top Secret 
Taiwan, Taoyuan City 
180998207

17.07.2025


The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Site Reliability Engineer, you will identify and deliver software improvements using your expertise in software development, complexity analysis, and scalable system design. Strong collaboration skills will be required to work closely with other engineering teams to ensure services/systems are highly stable and performant, meeting the expectations of our government customers and users.The right candidate for this job (is):Excited about making better software and continuously improving the development, integration, and deployment processes

Required/Minimum Qualifications:

  • Master's Degree in Computer Science, Information Technology, or related field
    • OR Bachelor's Degree in Computer Science, Information Technology,
    • or related field AND 1+ years of technical experience in software engineering, network engineering, or systems administration
    • OR 4+ years of technical experience in software engineering, network engineering, or systems administration

Other Requirements:

  • Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Candidates must have an active Top Secret and be willing to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance. Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination.
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
    • Clearance Verification : This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.
  • Citizenship & Citizenship Verification:This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions whererequiredorpermittedby applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance

Preferred/Additional Qualifications:

  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ years of technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years of technical experience in software engineering, network engineering, or systems administration
    • OR 5+ years of technical experience in software engineering, network engineering, or systems administration

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

Technical Knowledge and Domain-Specific Expertise

  • expertisein distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures. Canidentifyand recommend configurationsoptimalofcloud technology solutions andmodifythe code base that defines systems or cloud technologies to improve the reliability and operability of supported products with minimal guidance from other engineers.
  • Develops an understanding of the code, features, and operations of specific products at scale as required to contribute to incremental improvements in product availability, reliability, efficiency, observability, and/or performance; participates in on-boarding, code/design reviews, and regular meetings with the engineering teams that develop and/or manage those products.
  • Researches andmaintainsan awarenessinindustry trends, advances in distributed systems and cloud technologies, new tools, and/or processes formaintainingand improving product availability, reliability, efficiency, observability, and/or performance. Contributes to the implementation ofnew solutionswithin their team byidentifyingways they can be applied to solve persistent problems.

Contributions to Development and Design

  • Leverages technical expertise in large scale distributed systems and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or code to improve the availability, reliability, efficiency, observability, and performance of product components or features supported by their team.
  • Develops andtests basic changes to optimize code and improve the observability,reliabilityand operability of a defined range of platform, system, or product components or features with direction from other engineers.
  • Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles; leverages technical expertise on underlying systems/platforms and insights drawn from engagements with product engineering teams and telemetry analyses to propose potential improvements in code base and designs across components and features of one or more products.

Driving Operational Excellence

  • Independently develops code or scripts that automate the performance of repetitive and easily scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.
  • Leverages technical expertise and telemetry analysis across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.
  • Identifies opportunities to leverage existing tools and automation to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production; monitors the effects of changes across multiple components or features within a single platform or system.
  • Designs, develops, andmaintainstelemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and featuresoperatingat scale. Independently performs analyses using existing tools and/or models toidentifyinsights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations;monitorsthe impact of changes on operations metrics (e.g., Time-to-X).
  • Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, reliability, performance, and/or efficiency of components and features; proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.
  • Responds to incidents during regular on-call rotations byidentifyingthe level of impact, troubleshooting issues, and deployingappropriate fixesto resolve root cause(s); alerts product teams and owners to major customerimpactingissues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Shares detailsrelated to incidents and their resolution through post-mortem reports and during regular review meetings.
  • Develops alerts and instrumentation across components and features tomonitorproductcapacityand resource demands and analyze telemetry data using existingcapacityplanning models; draws insights from analyses ofcapacityand resource data to optimizecomponentand feature code to manage resources andcapacityacross limited range of use conditions and system parameters.
  • Utilizes insights from performance and resource monitoring tools toidentifywhether there is a need to optimize the efficiency ofcomponentand feature code, or if changes to compute resources arerequired; models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions.
  • Shares insights and best practices that can be applied to improve development and operations of system, platform, or product components and features byparticipating


Additional Duties

  • Design, develop, and deliver the required software engineering to serve and protect O365 government clouds.
  • Own deployment, availability, reliability, performance and customer escalation targets for sovereign environments.
  • Proactively identify and reduce issues through design, testing, and implementation of software-based solutions.
  • Collaborate with Engineering and Program Management partners to translate customer, business, and technical requirements into architectural designs and feature releases.
  • Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability.
  • Work within a highly skilled team of engineers to deliver revolutionary improvements to the cloud and scale them.
  • mbody our