Each card shows the case, the golden resolution, and how every model did across 8 runs. Click any task to open the golden solution side-by-side with a model's failing trajectory.