Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Microsoft Senior Site Reliability Engineer 
United States, Washington 
723209312

10.09.2024

Microsoft is looking for a Senior Site Reliability Engineer (SRE) to support and expand Viva Engage. Viva Engage (formerly Yammer) is the industry-defining social networkfor the enterprise

is responsible forkeeping the services reliable as we scale and modernize our tech stack. We need a SRE who knows how to manage the conflicting priorities of keeping things running today while making sure we have the architecture we need for the future.

technology, outsized individual impact - with the advantages of working for one of the most successful software companies in the world. We believe in mission-driven work and in this post-Covid world, our platform has become more indispensable than ever as it fosters connection and a sense of belonging among remote teams.

Required Qualifications:

  • 6+ years technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administrationOR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Other Requirements:

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:

  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
  • Citizenship & Citizenship Verification:This position requires verification of U.S. citizenship due to citizenship- based legal restrictions. Specifically, this position supports United statesfederal,state, and/or local(or applicable country) United States government agency customers and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport.


Preferred Qualifications:

  • Experience applying SRE principles in a large production environment.
  • Proficiencyin cloud computing platforms (e.g., AWS, Azure, GCP) and related services (e.g., EC2, S3, VPC, IAM, Lambda).
  • Expertisein automation tools and frameworks (e.g., Terraform, Ansible, Chef, Puppet) and scripting languages (e.g., Python, Bash).
  • Deep understanding of containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack) and incident response processes.
  • Problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
  • Effective communication and collaboration skills, with the ability to work effectively in a cross-functional team environment.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

Responsibilities
  • Participate in on-call rotations and incident responses throughout product development and operations cycles. On-call will require responding to support requests after normal business hours to include the weekends and/or holidays in a designated Microsoft office.
  • Monitor system performance and proactivelyidentifyand resolve issues to ensure high availability and performance.
  • Develop andmaintainautomation tools and processes for deployment, monitoring, and configuration management.
  • Apply troubleshooting skills, debugging tools, andexamineslogs, telemetry, and other methods to verify assumptions and customer impact. Proactively and reactively address findings with customer and/or service engineering efficiently via written and verbal communications.
  • Lead blameless postmortems for root cause and production resiliency.
  • Consult with developers to design services that scale in Azure.
  • Mentor team members and contribute to the overall growth and development of the SRE team.
  • Stay current with industry trends, emerging technologies, and best practices in site reliability engineering and cloud computing.

Embody our