Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Microsoft Senior Site Reliability Engineer - CTJ Top Secret 
United States, Washington 
361015303

16.07.2024

As a, you will be instrumental in defining operating models for deploying and managing systems within sovereign and air-gapped environments. This role offers the unique opportunity to collaborate with engineers dedicated to enabling a wide range of Azure services for both internal and external customers in highly secured and regulated industries. The systems, processes, and frameworks you develop will be essential in meeting the stringent security policy and assurance requirements of our diverse customer base in the public and private sectors.

Minimum/Required Qualifications:

  • 6+ years technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
    • OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Other Requirements:

Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:

  • The successful candidate must have an active U.S. Government Top Secret Security Clearance. Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. Failure to maintain or obtain the appropriate clearance and/or customerscreening requirements may result in employment action up to and including termination.
  • Clearance Verification : This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.
  • Microsoft Cloud Background Check : This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
  • Criminal Justice Information Services: This position requires passing a background check conducted through the CJIS criminal justice information system by authorized local, state, and/or federal agencies and across multiple states. This role requires candidates to maintain CJIS screening eligibility.
  • Citizenship& Citizenship Verification: This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance

Preferred/Additional Qualifications:

  • 3+ years of experience with PowerShell, C#, or C++.
  • Experience working on large-scale distributed services with on-call responsibilities.
  • Ability to build and influence broadly towards common goals and priorities.
  • Ownership for end-to-end project lifecycle with solid project management and communication skills.
  • Experience applying Site Reliability Engineering (SRE) principles in a large production environment.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:Microsoft will accept applications for the role until July 23, 2024.


Responsibilities

The scale of our operations is enormous. We need people who enjoy analyzing complicated problems, coming up with creative solutions, working in focused teams to build things no-one has thought of before, all in the service of production reliability.

  • Defines and develops standardized, repeatable, scalable solutions to guarantee quality and efficient operations. Drive the design, optimization, efficiency and reliability of service management.
  • Communicate on a deeply technical level with software engineers, project management, and operations teams to improve and optimize products, improve infrastructure, reduce manual toil, and evolve services.
  • Drives efforts to collect, classify, and analyze data on a range of metrics. Drives the refinement of products through data analytics and makes informed decisions in engineering products through data integration.
  • Drives efforts to integrate instrumentation for gathering telemetry data on system behavior such as performance, reliability, availability, and usage. Drives sustaining feedback loops from telemetry resulting in subsequent designs. Creates outputs of telemetry such as notifications or dashboards.
  • Applies debugging tools and examines logs, telemetry, and other methods to verify assumptions through writing and developing code proactively before issues occur and reactively as issues occur for products. Conducts retrospective debugging of solutions to identify root causes of problems.  Reviews and writes issues postmortem and shares insights with the team.
  • Builds, enhances, reuses, contributes to, and identifies new software developer tools/processes to support other programs and applications to create, debug, and maintain code for products. Uses open source when appropriate. Begins to develop skills in other tools/topics outside areas of expertise. Identifies internal tools and/or creates tools that will be useful for creating the product, determining if methods are still applicable for the current solution. Shares best practices and teaches others about new tools and strategies.
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions. Alerts stakeholders as to status and initiates actions to restore system/product/service for simple problems and complex problems when appropriate. Responds within Service Level Agreement (SLA) timeframe. Drives efforts to reduce incident & request volumes, looking globally at incidences and providing broad resolutions. Escalates issues to appropriate owners.
  • Ability to meet on call responsibilities periodically to support 24x7 operations.

Embody our