BS and a minimum of 10 years relevant industry experience.
Strong experience in evaluating supervised, unsupervised, and deep learning models.
Hands-on experience evaluating LLMs (e.g., GPT, Claude, PaLM) and using them as scoring/judging mechanisms.
Familiarity with multimodal models (e.g., image + text, video + audio) and related evaluation challenges.
Proficiency in Python and libraries such as NumPy, pandas, scikit-learn, PyTorch, or TensorFlow.
Solid understanding of statistical testing, sampling, confidence intervals, and metrics (e.g., precision/recall, BLEU, ROUGE, FID, etc.).
Strong documentation skills, including the ability to write technical reports and present to non-technical audiences.
Experience working with open-source evaluation tools like OpenEval, ELO-based ranking, or LLM-as-a-Judge frameworks.
Familiarity with prompt engineering, few-shot or zero-shot evaluation techniques.
Experience evaluating generative models (e.g., text generation, image generation).
Prior contributions to ML benchmarks or public evaluations.
Strong interpersonal skills.
Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.