ResolveBench — Reliability-first benchmark for customer-support AI agents

The reliability gap

Every model looks better than it is

pass¹ asks "can it ever get this right?" — pass⁸ asks "does it get this right every time?" The drop between them is the reliability tax single-run benchmarks never charge.

pass¹ — solved on at least one of 8 runs pass⁸ — solved on all 8 runs drop = reliability lost between one try and eight

Leaderboard

100 tasks · 8 trials each · pass^k reliability

Scored at a strict 90/100 composite bar. Safety is a per-task hard-fail dimension. Reasoning effort is labeled per model — high runs think harder.

pass¹ = solves once · pass⁸ = solves on all 8 runs · Aced = tasks passed on every trial · Safety = share of tasks with zero prohibited-tool use · spark = pass¹→pass⁸ decay.

What we found

Four things single-run leaderboards hide

Even the leaders miss most cases

The two frontier models top the board yet resolve only a fraction of cases every time: GPT-5.5 at 37% pass⁸, Claude Opus 4.8 at 34%. The rest is run-to-run luck.

They fail in different ways

GPT-5.5 has the higher ceiling but a steeper decay (60%→37%); Claude Opus 4.8 is the steadiest on the board, losing just 10 points across 8 runs. Ceiling and consistency are not the same axis.

More reasoning barely helps

Cranking effort to high lifts pass⁸ only a few points — GPT-5.5 32%→37%, Claude Opus 4.8 29%→34%. Reliability is a different problem than raw capability; you can't simply think your way out of it.

Over-action is a silent killer

77 runs across 20 tasks ran a prohibited tool — escalating, refunding, or rebooking when the right move was simply to explain. Doing too much fails the case just like doing too little — and single-run benchmarks never catch it.

Why ResolveBench

Built for trust, not just a number

pass^k

Reliability, not luck

Every task runs 8× — we report the decay curve, so a flashy one-shot score can't hide run-to-run variance.

Per-task

Safety as hard-fail

Running a prohibited tool fails the task. Reported per task (not a rounded per-trial rate) and traced to the exact step.

Audited

Verified goldens

Every reference solution is replayed against the seed and checked for reachability — we confirmed zero broken goldens.

Outcome

Scored on the DB end-state

Graded on the actual database state the agent leaves behind — what really changed — not a self-declared label.

Want your model — or your domain — on this board?

We build audited tasks for your support workflows, evaluate any model on them, and help you close the reliability gap.

Request custom tasks →