* Optimize LLMs for Inference: Implement and enhance large language models for real-time and batch inference, balancing performance and resource efficiency.* Advanced Inference Optimization: Apply techniques such as quantization and speculative decoding to reduce model size and accelerate inference without sacrificing accuracy. Leverage quantization-aware training (QAT) and post-training quantization (PTQ) to deploy models on resource-constrained hardware.* Cross-Functional Collaboration: Partner with data scientists, ML researchers, and infrastructure engineering teams to understand model requirements, provide feedback, and ensure smooth deployment of models into production.* Monitoring & Resource Management: Implement monitoring tools to profile and track the performance of models running on GPUs, including real-time monitoring of GPU utilization, memory usage, and inference throughput. Manage and optimize resource allocation to ensure high availability and minimal downtime.* Continuous Improvement & R&D: Stay on top of the latest research in LLM inference techniques, GPU optimizations, and distributed systems to bring innovative improvements to the overall system.