● Reliability-first agent evaluation

Solving a support case once is easy.
Solving it every single time isn't.

ResolveBench runs frontier AI agents through 100 realistic, audited support cases — 8 independent times each — and scores whether they resolve the case reliably, not just once. The gap between "got it once" and "gets it every time" is where production agents break.

The reliability gap

Every model looks better than it is

pass¹ asks "can it ever get this right?" — pass⁸ asks "does it get this right every time?" The drop between them is the reliability tax single-run benchmarks never charge.

pass¹ — solved on at least one of 8 runs pass⁸ — solved on all 8 runs drop = reliability lost between one try and eight
Leaderboard

100 tasks · 8 trials each · pass^k reliability

Scored at a strict 90/100 composite bar. Safety is a per-task hard-fail dimension. Reasoning effort is labeled per model — high runs think harder.

pass¹ = solves once · pass⁸ = solves on all 8 runs · Aced = tasks passed on every trial · Safety = share of tasks with zero prohibited-tool use · spark = pass¹→pass⁸ decay.

What we found

Four things single-run leaderboards hide

01

Even the leaders miss most cases

The two frontier models top the board yet resolve only a fraction of cases every time: GPT-5.5 at 37% pass⁸, Claude Opus 4.8 at 34%. The rest is run-to-run luck.

02

They fail in different ways

GPT-5.5 has the higher ceiling but a steeper decay (60%→37%); Claude Opus 4.8 is the steadiest on the board, losing just 10 points across 8 runs. Ceiling and consistency are not the same axis.

03

More reasoning barely helps

Cranking effort to high lifts pass⁸ only a few points — GPT-5.5 32%→37%, Claude Opus 4.8 29%→34%. Reliability is a different problem than raw capability; you can't simply think your way out of it.

04

Over-action is a silent killer

77 runs across 20 tasks ran a prohibited tool — escalating, refunding, or rebooking when the right move was simply to explain. Doing too much fails the case just like doing too little — and single-run benchmarks never catch it.

Why ResolveBench

Built for trust, not just a number

pass^k
Reliability, not luck
Every task runs 8× — we report the decay curve, so a flashy one-shot score can't hide run-to-run variance.
Per-task
Safety as hard-fail
Running a prohibited tool fails the task. Reported per task (not a rounded per-trial rate) and traced to the exact step.
Audited
Verified goldens
Every reference solution is replayed against the seed and checked for reachability — we confirmed zero broken goldens.
Outcome
Scored on the DB end-state
Graded on the actual database state the agent leaves behind — what really changed — not a self-declared label.

Want your model — or your domain — on this board?

We build audited tasks for your support workflows, evaluate any model on them, and help you close the reliability gap.

Request custom tasks →