Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia Senior Prompt Benchmark Engineer Evaluation World Models 
United States, Texas 
840023630

24.11.2025
US, CA, Santa Clara
US, CA, Remote
time type
Full time
posted on
Posted 5 Days Ago
job requisition id

What you’ll be doing:

  • Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments.

  • Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models.

  • Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows.

  • Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus.

  • Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption.

  • Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases.

  • Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability.

  • Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats.

What we need to see:

  • Demonstrated experience with prompt engineering, including crafting, refining, and optimizing prompts.

  • Strong attention to detail in designing natural language questions and formatting structured evaluations.

  • Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments.

  • Experience crafting or contributing to benchmarks or evaluation datasets, especially for multimodal or agentic systems.

  • Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families.

  • Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design.

  • A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows.

  • 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields.

  • BS, MS, or equivalent background. Prior experience in AI evaluation, annotation workflows, or research is highly valued.

Ways to stand out from the crowd:

  • Hands-on experience with multiple LLMs or VLMs (e.g., GPT, Claude, Gemini, Flamingo, Kosmos, IDEFICS, etc.) to compare outputs and engineer task-specific prompts.

  • Prior work designing benchmarks for robotics, simulation, AV, or agentic tasks, especially in multimodal or video-based settings.

  • Experience working with human annotation teams, building clear instructions and QA processes for large-scale labeling campaigns.

  • Familiarity with using VLMs as evaluators, leveraging models for response scoring, ranking, or consensus aggregation.

  • Deep curiosity about model behavior and a drive to test, interrogate, and stretch the limits of generative systems.

You will also be eligible for equity and .