In this role you will: - Design and analyze human evaluations of AI systems to create reliable annotation frameworks, and ensure validity and reliability of measurements of latent constructs- Develop and refine benchmarks and evaluation protocols, using statistical modeling, test theory, and task design to capture model performance across diverse contexts and user needs- Conduct statistical analysis of evaluation data to extract meaningful insights, identify systematic issues, and inform improvements to both models and evaluation processes- Independently run and analyze experiments for real improvements