Finding the best job has never been easier
Share
What you'll do...
Principal Site Reliability Engineer:
This position is responsible for the operation of a department. An individual in this position will be expected to perform additional job related responsibilities and duties as assigned and/or necessary.
Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Opensource Chaos tools (for example, Openblade, Chaos Monkey, Pumba, Chaos Mesh, Litmus, Chaos Toolkit, ToxiProxy) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex site environment. Facilitates reliability testing procedures. Ensures reliability testing procedures align with site environment changes.
Solution Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain.
Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security.
Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech
Stacks; Non-Functional Requirements; Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and
Compliance Review- SRCR etc.)
To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.
Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. JavaScript, Python, C# etc.), frameworks(E.g. ActiveX, .Net, Cocoa, Android application framework etc.), tools(E.g. Monday.com, Linx, Embold etc.) and Platforms (E.g. Microsoft Azure, AWS , Apple IOS etc.); Quality, Safety and Security (PCI etc) standards; Emerging tools and technologies; Telemetry.
To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs.
Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.
Triaging and Troubleshooting : Requires knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.For information about PTO, see
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.For information about benefits and eligibility, see
Bellevue, Washington US-11075:The annual salary range for this position is $132,000.00-$264,000.00 SUNNYVALE, California US-04396:The annual salary range for this position is $143,000.00-$286,000.00 Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include: - Stock Minimum Qualifications...These jobs might be a good fit