Responsibilities:Work on cutting-edge ML inference framework project and optimize code for efficient and scalable ML inference using distributed compute strategies such as data, tensor, pipeline and expert parallelism.Develop kernel and compiler level optimizations and perform in-depth analysis to ensure the best possible performance across Server hardware families.Apply advanced model optimization techniques including speculation, quantization, compression, and others to maximize throughput and minimize latency.Analyze and improve performance metrics such as end-to-end latency, TTFT, TBOT, memory footprint, and compute efficiency.Implement features of Metal device backend for ML training acceleration technologies