What are the responsibilities and job description for the LLM Evals Engineering Lead position at Grafton Sciences?
About Grafton Sciences
We’re building AI systems with general physical ability — the capacity to experiment, engineer, or manufacture anything. We believe achieving this is a key step towards building superintelligence. With deep technical roots and real-world progress at scale (e.g., a $42M NIH project), we’re pushing the frontier of physical AI. Joining us means inventing from first principles, owning real systems end-to-end, and helping build a capability the world has never had before.
About The Role
We’re seeking a Senior LLM Evals Engineer to build the evaluation and verification layer for agentic, LLM systems acting in complex environments driving autonomous workflows. You’ll design eval suites, automated verifiers, and regression gates that measure real progress on long-horizon planning, agent execution, uncertainty retirement, and end-to-end build success. This role spans systems engineering, rigorous experimentation, and tight collaboration with LLM scientists, agent/toolchain engineers, and simulation teams.
Responsibilities
We’re building AI systems with general physical ability — the capacity to experiment, engineer, or manufacture anything. We believe achieving this is a key step towards building superintelligence. With deep technical roots and real-world progress at scale (e.g., a $42M NIH project), we’re pushing the frontier of physical AI. Joining us means inventing from first principles, owning real systems end-to-end, and helping build a capability the world has never had before.
About The Role
We’re seeking a Senior LLM Evals Engineer to build the evaluation and verification layer for agentic, LLM systems acting in complex environments driving autonomous workflows. You’ll design eval suites, automated verifiers, and regression gates that measure real progress on long-horizon planning, agent execution, uncertainty retirement, and end-to-end build success. This role spans systems engineering, rigorous experimentation, and tight collaboration with LLM scientists, agent/toolchain engineers, and simulation teams.
Responsibilities
- Build an eval harness for agentic LLM systems (offline, simulator-in-the-loop, and workflow-in-the-loop).
- Design evals for long-horizon planning, specific agent-call correctness, recovery behavior, and safety/constraint adherence.
- Help with verifier-driven scoring (symbolic checks, simulation/twin checks, surrogate checks) and automated self correction of execution pipeline.
- Create regression gates and release criteria for model/prompt/toolchain changes; prevent capability and safety regressions.
- Define metrics for outliers identification and efficient question-asking that reduces uncertainty per unit time.
- Partner with training teams to turn eval failures into data (SFT/DPO/RL signals) and continuously improve the suite.
- Strong experience building evaluation systems for ML models (LLMs preferred) with high engineering rigor.
- Excellent software engineering skills (Python, data pipelines, test harnesses, distributed execution, reproducibility).
- Deep understanding of agentic failure modes (tool misuse, hallucinated evidence, reward hacking, brittle formatting) and how to measure them.
- Ability to work across research and production systems in a fast-moving environment.
- We offer competitive salary, meaningful equity, and benefits.