Share
Basic Qualifications:
Master’s degree or above (or equivalent experience) in Computer Science, Engineering, Mathematics, Physics, or a related field.
Strong programming skills with hands-on experience in managing large-scale data and machine learning pipelines.
Deep understanding of open-source ML frameworks such as PyTorch, vLLM, and TensorRT-LLM (TRT-LLM).
Solid knowledge of model optimization techniques, including quantization, pruning, and efficient inference.
Preferred Qualifications:
1+ years of experience optimizing LLM inference using frameworks like vLLM or TRT-LLM.
Practical experience in model compression and deployment within production systems.
Experience designing agentic AI systems, such as multi-agent orchestration, tool usage, planning, and reasoning.
Model Optimization & Deployment:
Design and implement efficient workflows for training, distillation, and fine-tuning Small and Large Language Models (SLMs), leveraging techniques such as LoRA, QLoRA, and instruction tuning.
Apply model compression strategies—including quantization (e.g., GPTQ, AWQ) and pruning—to reduce inference costs and improve latency.
Optimize LLM inference performance using frameworks like vLLM and TensorRT-LLM (TRT-LLM) to enable scalable, low-latency deployment.
Build robust and scalable inference systems tailored to heterogeneous production environments, with a strong focus on performance, cost-efficiency, and stability.
These jobs might be a good fit