Develop and deploy solutions to scale our infrastructure effectively in response to rapidly growing demands
Drive implementation of best practices and monitoring systems to proactively detect and address issues in our production environment
Work across the stack on tools and infrastructure empowering the machine learning team to be effective. This ranges from developing/running model training and evaluation code to back-end infrastructure to occasional front-end work
Coordinate required resources with the team managing the cluster hardware to maintain high availability
Work closely with the research team to understand requirements and priorities.
What You’ll Bring
Expertise in designing scalable and durable distributed systems
Strong knowledge of Python/Go and Linux
Experience working with diverse backend infrastructure components (SQL / NoSQL databases, caching, message brokers, event streams, monitoring etc)
Hands-on experience with containerization and orchestration technologies (Docker, Kubernetes) and setting up CI/CD flows
Knowledge of front-end development in React / strong product sense
Knowledge of machine learning, computer vision, or neural networks