Demo

Research Engineer - Evals

AGI, Inc.
San Francisco, CA Full Time
POSTED ON 5/28/2026
AVAILABLE BEFORE 6/26/2026
Think Different. Build the Future. 🚀

Our Mission

Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day.

Why AGI, Inc.

We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind. We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale.

Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts.

We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo)

If you see possibility where others see limits, read on.

You decide what "better" means.

Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust — and the team makes decisions on real signal, not the loudest opinion in the room.

You'll build the eval harness for AGI — across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines.

🤩 Tasks you will own

  • The eval suites that gate every model and agent release — capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss
  • The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy
  • The bar — what counts as ready to ship, and how we know

🤚 Areas where you will assist

  • Research, by making sure what we measure is what we want
  • Product engineers, by instrumenting real-user behavior on real devices
  • Partnerships, by translating "did it get better" into language an OEM partner can hold us to

📚 Skills you'll be expected to teach

  • How to measure non-deterministic systems — agent eval, tool use, long-horizon tasks, multilingual behavior
  • How to push back on a metric that's being gamed without breaking the team

🧑‍🎓 Skills you'll be expected to learn

  • On-device perf trade-offs and how they show up in real-user evals
  • What QA-ing AI at OEM scale actually looks like
  • The realities of shipping consumer agents to production partners

🏆 Timeline of success

After 30 days — You've audited every eval we run today and produced a sharp doc on what's good, what's noise, and what's missing. You've fixed the most embarrassing gap.

After 60 days — You've stood up a new eval surface — agentic, on-device, or behavioral — and the team is making real decisions on its output. Researchers come to you before launching a run, not after.

After 90 days — Releases now ship against your eval bar, not a vibe-check. You've caught a regression that would have shipped, and cleared a launch the team was nervous about. You're shaping the research roadmap by surfacing where we're flat, where we're climbing, and where we're lying to ourselves.

đź’° Compensation

Competitive cash and meaningful equity. Top-tier relocation and immigration support. SF, in person.

How to apply

Send a link to an eval, benchmark, or measurement system you built — and one paragraph on what decision it changed. Plus your resume or LinkedIn. Every exceptional candidate hears back within 48 hours.

Salary.com Estimation for Research Engineer - Evals in San Francisco, CA
$129,821 to $165,758
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Research Engineer - Evals?

Sign up to receive alerts about other jobs on the Research Engineer - Evals career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$113,077 - $147,784
Income Estimation: 
$135,356 - $164,911
Income Estimation: 
$153,902 - $198,246
Income Estimation: 
$113,077 - $147,784
Income Estimation: 
$135,356 - $164,911
Income Estimation: 
$153,902 - $198,246
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at AGI, Inc.

  • AGI, Inc. Clearwater, FL
  • Company Summary: Exciting opportunity to join an established company that has spent 50 years leading and expanding the signage industry. AGI serves some of... more
  • 1 Day Ago

  • AGI, Inc. San Francisco, CA
  • Think Different. Build the Future. 🚀 Our Mission Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions.... more
  • 3 Days Ago

  • AGI, Inc. Sioux, SD
  • Position Title: Manufacturing and Quality Engineering Manager Location: Sioux Falls, SD About AGI AGI is a global food-based infrastructure company publicl... more
  • 7 Days Ago

  • AGI, Inc. Sioux, SD
  • Position Title: Engineering Manager Department: Engineering Location: Sioux Falls, SD About AGI AGI is a global food-based infrastructure company publicly ... more
  • 7 Days Ago


Not the job you're looking for? Here are some other Research Engineer - Evals jobs in the San Francisco, CA area that may be a better fit.

  • Mercor San Francisco, CA
  • About Mercor Mercor's mission is to organize human intelligence to power the AI economy. We partner with leading AI labs and enterprises to provide the hum... more
  • 13 Days Ago

  • Fieldguide San Francisco, CA
  • About Us Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance ... more
  • 11 Days Ago

AI Assistant is available now!

Feel free to start your new journey!