Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Microsoft Senior Site Reliability Engineer- O365 Exchange Online 
United States, Georgia, Atlanta 


Senior Site Reliability Engineer - O365 Exchange Online

looking for aSenior Site Reliability Engineer - O365 Exchange Onlinewith the right mix of systems engineering, software development, on-line services

: Our approach is layered and precise. By implementing proactive engineering solutions, weand tackle incidents head-on, ensuring limited disruptions. Monitoring, both comprehensive and nuanced,

The Future –Intelligence (AI) & Machine Learning (ML) in Focusearly stagesof integrating predictive analytics toissues before they manifest, allowing us to stay a step ahead. Customized ML models are being developed to intelligently sift through vast data lakes,about redefining reliability, precision, and the user experience in the M365 suite.

Required Qualifications:

  • 6+ years technical experience in software engineering, network engineering, or systems administration
    • ORBachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
    • ORMaster's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Other Qualifications:

to meet Microsoft, customer and/or government security screening requirementsfor this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:

      • This position will berequiredto pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • 6+ years’ experience troubleshooting, investigating, and fixing production issues in large scale cloud and/or hosted environments.
  • 4+years experiencewith building infrastructure using Microsoft Azure technology.
  • 5+years experiencewriting programsleveraginga major cloud service (C++, C# or Node.JS) including experience with algorithms, data structures, and software design.
  • Familiarity with core machine learning concepts, including infrastructure and open-source options (ex: compute systems - GPU & FPGA, AI/ML frameworks – TensorFlow,MLflow, JAX &PyTorch, tools -Jupyternotebooks & VS Code, etc.).
  • Familiarity with the Microsoft Azure cloud as well as technologies such as Azure ML, Microsoft’s Cognitive Services, Azure OpenAI, or Azure Cognitive Search or similar experience with another cloud platform.
  • Familiarity using Large Language Models and Generative AI to solve real-world problems.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:Microsoft will accept applications for the role until May 19, 2024.

Technical Knowledge and Domain-Specific Expertise

  • Researches andmaintainsdeep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies;identifiesopportunities to create, implement, and/or optimallyutilizenew tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance. Drives the adoption ofnew solutionsacross engineering teams working with related products within an organization and provides guidance and coaching to others on relevant topics.
  • Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with demonstrated testing and maintainability.
  • Contributions to Development and Design.
  • Engages with product engineering teams by driving code/design reviews, hosting regular meetings, andparticipatingin on-call rotations and incident responses throughout product development and operations cycles;leveragesend-to-end technicalexpertiseon underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/businessobjectivesand incident prevention.

Driving Operational Excellence

  • Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale; reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization.
  • Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or productsoperatingat scale. Contributes to the development of new tooling and/or predictive models toidentifyand test potential improvements in product development and/oroperations, andmonitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization.
  • Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the highly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings.
  • Shares insights and best practices that can be applied to improve development and operations across related sets of systems, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with more experienced Site Reliability Engineers (SREs) and members of product engineering teams. Mentors and coaches less experienced engineers to help themidentifyand propose relevant solutions.
  • Serve as a point of contact, trusted advisor and interact with customers other external stakeholders as a spokesperson for customer confidence or escalations calls and Support process for incident management including quality control of Root Cause Analysis (RCAs).
  • Embodyour