What are the responsibilities and job description for the LLM Evals Engineering Lead position at Grafton Sciences?

About Grafton Sciences

We’re building AI systems with general physical ability — the capacity to experiment, engineer, or manufacture anything. We believe achieving this is a key step towards building superintelligence. With deep technical roots and real-world progress at scale (e.g., a $42M NIH project), we’re pushing the frontier of physical AI. Joining us means inventing from first principles, owning real systems end-to-end, and helping build a capability the world has never had before.

About The Role

We’re seeking a Senior LLM Evals Engineer to build the evaluation and verification layer for agentic, LLM systems acting in complex environments driving autonomous workflows. You’ll design eval suites, automated verifiers, and regression gates that measure real progress on long-horizon planning, agent execution, uncertainty retirement, and end-to-end build success. This role spans systems engineering, rigorous experimentation, and tight collaboration with LLM scientists, agent/toolchain engineers, and simulation teams.

Responsibilities

Build an eval harness for agentic LLM systems (offline, simulator-in-the-loop, and workflow-in-the-loop).
Design evals for long-horizon planning, specific agent-call correctness, recovery behavior, and safety/constraint adherence.
Help with verifier-driven scoring (symbolic checks, simulation/twin checks, surrogate checks) and automated self correction of execution pipeline.
Create regression gates and release criteria for model/prompt/toolchain changes; prevent capability and safety regressions.
Define metrics for outliers identification and efficient question-asking that reduces uncertainty per unit time.
Partner with training teams to turn eval failures into data (SFT/DPO/RL signals) and continuously improve the suite.

Qualifications

Strong experience building evaluation systems for ML models (LLMs preferred) with high engineering rigor.
Excellent software engineering skills (Python, data pipelines, test harnesses, distributed execution, reproducibility).
Deep understanding of agentic failure modes (tool misuse, hallucinated evidence, reward hacking, brittle formatting) and how to measure them.
Ability to work across research and production systems in a fast-moving environment.

Compensation

We offer competitive salary, meaningful equity, and benefits.

Apply for this job

Receive alerts for other LLM Evals Engineering Lead job openings

LLM Evals Engineering Lead

What are the responsibilities and job description for the LLM Evals Engineering Lead position at Grafton Sciences?

What is the career path for a LLM Evals Engineering Lead?

Job openings at Grafton Sciences

Not the job you're looking for? Here are some other LLM Evals Engineering Lead jobs in the San Francisco, CA area that may be a better fit.

We don't have any other LLM Evals Engineering Lead jobs in the San Francisco, CA area right now.

AI Assistant is available now!