Demo

LLM Evals Engineering Lead

Grafton Sciences
San Francisco, CA Full Time
POSTED ON 1/1/2026
AVAILABLE BEFORE 1/30/2026
About Grafton Sciences

We’re building AI systems with general physical ability — the capacity to experiment, engineer, or manufacture anything. We believe achieving this is a key step towards building superintelligence. With deep technical roots and real-world progress at scale (e.g., a $42M NIH project), we’re pushing the frontier of physical AI. Joining us means inventing from first principles, owning real systems end-to-end, and helping build a capability the world has never had before.

About The Role

We’re seeking a Senior LLM Evals Engineer to build the evaluation and verification layer for agentic, LLM systems acting in complex environments driving autonomous workflows. You’ll design eval suites, automated verifiers, and regression gates that measure real progress on long-horizon planning, agent execution, uncertainty retirement, and end-to-end build success. This role spans systems engineering, rigorous experimentation, and tight collaboration with LLM scientists, agent/toolchain engineers, and simulation teams.

Responsibilities

  • Build an eval harness for agentic LLM systems (offline, simulator-in-the-loop, and workflow-in-the-loop).
  • Design evals for long-horizon planning, specific agent-call correctness, recovery behavior, and safety/constraint adherence.
  • Help with verifier-driven scoring (symbolic checks, simulation/twin checks, surrogate checks) and automated self correction of execution pipeline.
  • Create regression gates and release criteria for model/prompt/toolchain changes; prevent capability and safety regressions.
  • Define metrics for outliers identification and efficient question-asking that reduces uncertainty per unit time.
  • Partner with training teams to turn eval failures into data (SFT/DPO/RL signals) and continuously improve the suite.

Qualifications

  • Strong experience building evaluation systems for ML models (LLMs preferred) with high engineering rigor.
  • Excellent software engineering skills (Python, data pipelines, test harnesses, distributed execution, reproducibility).
  • Deep understanding of agentic failure modes (tool misuse, hallucinated evidence, reward hacking, brittle formatting) and how to measure them.
  • Ability to work across research and production systems in a fast-moving environment.

Compensation

  • We offer competitive salary, meaningful equity, and benefits.

Salary.com Estimation for LLM Evals Engineering Lead in San Francisco, CA
$115,184 to $136,724
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a LLM Evals Engineering Lead?

Sign up to receive alerts about other jobs on the LLM Evals Engineering Lead career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$145,742 - $181,134
Income Estimation: 
$155,690 - $196,630
Income Estimation: 
$170,445 - $218,624
Income Estimation: 
$86,543 - $113,425
Income Estimation: 
$90,032 - $105,965
Income Estimation: 
$111,859 - $131,446
Income Estimation: 
$110,457 - $133,106
Income Estimation: 
$105,809 - $128,724
Income Estimation: 
$122,763 - $145,698
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Grafton Sciences

  • Grafton Sciences Redwood, CA
  • About Grafton Sciences We're building AI systems with general physical ability — the capacity to experiment, engineer, or manufacture anything. We believe ... more
  • 12 Days Ago

  • Grafton Sciences San Francisco, CA
  • About Grafton Sciences We’re building AI systems with general physical ability — the capacity to experiment, engineer, or manufacture anything. We believe ... more
  • 5 Days Ago


Not the job you're looking for? Here are some other LLM Evals Engineering Lead jobs in the San Francisco, CA area that may be a better fit.

  • Scale AI, Inc. San Francisco, CA
  • As the leading data and evaluation partner for frontier AI companies, Scale is dedicated to advancing the evaluation and benchmarking of large language mod... more
  • 12 Days Ago

  • scaleai San Francisco, CA
  • As the leading data and evaluation partner for frontier AI companies, Scale is dedicated to advancing the evaluation and benchmarking of large language mod... more
  • 20 Days Ago

AI Assistant is available now!

Feel free to start your new journey!