ResolveBench runs frontier AI agents through 100 realistic, audited support cases — 8 independent times each — and scores whether they resolve the case reliably, not just once. The gap between "got it once" and "gets it every time" is where production agents break.
pass¹ asks "can it ever get this right?" — pass⁸ asks "does it get this right every time?" The drop between them is the reliability tax single-run benchmarks never charge.
Scored at a strict 90/100 composite bar. Safety is a per-task hard-fail dimension. Reasoning effort is labeled per model — high runs think harder.
pass¹ = solves once · pass⁸ = solves on all 8 runs · Aced = tasks passed on every trial · Safety = share of tasks with zero prohibited-tool use · spark = pass¹→pass⁸ decay.
The two frontier models top the board yet resolve only a fraction of cases every time: GPT-5.5 at 37% pass⁸, Claude Opus 4.8 at 34%. The rest is run-to-run luck.
GPT-5.5 has the higher ceiling but a steeper decay (60%→37%); Claude Opus 4.8 is the steadiest on the board, losing just 10 points across 8 runs. Ceiling and consistency are not the same axis.
Cranking effort to high lifts pass⁸ only a few points — GPT-5.5 32%→37%, Claude Opus 4.8 29%→34%. Reliability is a different problem than raw capability; you can't simply think your way out of it.
77 runs across 20 tasks ran a prohibited tool — escalating, refunding, or rebooking when the right move was simply to explain. Doing too much fails the case just like doing too little — and single-run benchmarks never catch it.
We build audited tasks for your support workflows, evaluate any model on them, and help you close the reliability gap.