* Designing, implementing, and maintaining distributed systems to build world-class ML platforms/products at scale* Experiment with, deploy, and manage LLMs in a production context* Benchmark and optimize inference deployments for different workloads, e.g. online vs. batch vs. streaming workloads* Diagnose, fix, improve, and automate complex issues across the entire stack to ensure maximum uptime and performance * Design and extend services to improve functionality and reliability of the platform* Monitor system performance, optimize for cost and efficiency, and resolve any issues that arise* Build relationships with stakeholders across the organization to better understand internal customer needs and enhance our product better for end users