Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Apple Site Reliability Engineer iCloud 
United Kingdom, England, London 
104155586

01.06.2024
Description
We are looking for an SRE with experience building and supporting machine learning (ML) infrastructure. You will apply SRE best practices to ensure the availability, reliability, and performance of our ML systems and services. You will actively engage with our development partners and product teams regularly so the ML services are well aligned with business needs. Responsibilities will include:Support and maintain ML services by measuring and monitoring availability, latency, and overall system health Deploy and support existing and new ML models and infrastructure Provide insights to partner stakeholders through log and telemetry analysis Maintaining documentation and automating manual processes where possible
Key Qualifications
  • Experience with large scale distributed systems. Experience with ML infrastructure services, including LLMs, Generative AI, and transformers desired.
  • In-depth knowledge of one or more of core operating system principles, networking fundamentals, and systems management.
  • Demonstrable advanced experience in at least one of Java, Python, Swift, Rust or GoLang and building distributed services/applications.
  • Awareness of key security principles including encryption and keys (types and exchange protocols).
  • Thorough understanding of SRE principals including monitoring, alerting, error budgets, fault analysis, and automation.
  • Strong sense of ownership with a desire to communicate and collaborate with other engineers and teams
  • Experience in mentoring and developing more junior engineers.
Education & Experience
BS degree in Computer Science or equivalent field.