Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Apple Senior Software Engineer ML Inference 
United States, California, Cupertino 
670296332

Yesterday
Description
* Optimize LLMs for Inference: Implement and enhance large language models for real-time and batch inference, balancing performance and resource efficiency.* Advanced Inference Optimization: Apply techniques such as quantization and speculative decoding to reduce model size and accelerate inference without sacrificing accuracy. Leverage quantization-aware training (QAT) and post-training quantization (PTQ) to deploy models on resource-constrained hardware.* Cross-Functional Collaboration: Partner with data scientists, ML researchers, and infrastructure engineering teams to understand model requirements, provide feedback, and ensure smooth deployment of models into production.* Monitoring & Resource Management: Implement monitoring tools to profile and track the performance of models running on GPUs, including real-time monitoring of GPU utilization, memory usage, and inference throughput. Manage and optimize resource allocation to ensure high availability and minimal downtime.* Continuous Improvement & R&D: Stay on top of the latest research in LLM inference techniques, GPU optimizations, and distributed systems to bring innovative improvements to the overall system.
Minimum Qualifications
  • Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 5+ years in software engineering focused on ML inference, GPU acceleration, and large-scale systems.
  • Expertise in deploying and optimizing LLMs for high-performance, production-scale inference.
  • Proficiency in Python, Java or C++.
  • Experience with deep learning frameworks like PyTorch, TensorFlow, and Hugging Face Transformers.
  • Experience with model serving tools (e.g., NVIDIA Triton, TensorFlow Serving, VLLM, etc)
  • Experience with optimization techniques like Attention Fusion, Quantization, and Speculative Decoding.
  • Skilled in GPU optimization (e.g., CUDA, TensorRT-LLM, cuDNN) to accelerate inference tasks.
  • Familiarity with cloud technologies like Docker, Kubernetes, AWS EKS for scalable deployment.
Preferred Qualifications
  • Master’s or PhD in Computer Science, Machine Learning, or a related field.
  • Understanding of ML Ops practices, continuous integration, and deployment pipelines for machine learning models.
  • Familiarity with model distillation, low-rank approximations, and other model compression techniques for reducing memory footprint and improving inference speed.
  • Strong understanding of distributed systems, multi-GPU/multi-node parallelism, and system-level optimization for large-scale inference.
Pay & Benefits
  • At Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $175,800 and $312,200, and your base pay will depend on your skills, qualifications, experience, and location.Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.
  • Apple is an equal opportunity employer that is committed to inclusion and diversity. We take affirmative action to ensure equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics.