Your Role and Responsibilitiescontainer orchestration (kubernetes), distributed ML workloads, network services, storage
layers, and petabyte scale AWS storage and Kafka stream stack.
Responsibilities:
- Develop and maintain scalable distributed systems in AWS
- Develop and maintain high performance k8s clusters across multiple regions
- Develop and maintain telemetry infrastructure & service instrumentation (python) for metrics, distributed tracing, and logging
- Support infrastructure for a petabyte scale data platform and stream analysis services
- Work with Audio and Speech AI Engineers to accelerate development and deployment of heterogeneous analysis and distributed training pipelines
- Participate in the definition and management of SLIs, SLOs and error budgets for infrastructure and production services
- Design and implement infrastructure-as-code pipelines
Required Technical and Professional Expertise
- AWS experience designing, implementing, and support cloud-based infrastructure
- Experience architecting, deploying, and supporting kubernetes in cloud environments
- Experience designing and supporting distributed systems
- Experience writing production code in one of more languages such as Python (preferred), Java, Go in a microservices environments
- Linux experience configuring, supporting, and optimizing
Preferred Technical and Professional Expertise
- Familiarity running distributed ML workloads in cluster orchestrated environments
- Experience building and supporting telemetry and related infrastructure (Open telemetry, Jaeger, Grafana, Prometheus)
- Experience designing and implementing infrastructure as code pipelines
- PubSub Experience (Kafka, SQS, SNS, MQTT)
- Experience designing and implementing traffic routing strategies in edge and microservices environments.