KEY RESPONSIBILITIES:- Design and build scalable infrastructure for fine-tuning, and deploying large language models.- Develop and optimize inference pipelines using popular frameworks and engines (e.g. TensorRT, vLLM, Triton Inference Server).- Implement observability solutions for model performance, latency, throughput, GPU/TPU utilization, and memory efficiency.- Own the end-to-end lifecycle of LLMs in production—from experimentation to continuous integration and continuous deployment (CI/CD).- Automate and harden model deployment workflows using Python, Kubernetes, Containers and orchestration tools like Argo Workflows and GitOps.- Design reproducible model packaging, versioning, and rollback strategies for large-scale serving.- Stay current with advances in LLM inference acceleration, quantization, distillation, and model compilation techniques (e.g., GGUF, AWQ, FP8).