The final bench scoreboard — 25 models, $21.92, all CAPABLE
Following yesterday's
retraction — where we found that our bench's v1 case authoring was punishing models for *not* making up REST-style parameters that the actual MCP tools didn't accept — we shipped a v3 bench with four case-design fixes, per-case allowed_tools narrowing, and (most importantly) a multi-judge consensus on the semantic axis: scores are the median of three judges from three different families (Anthropic + DeepSeek + Gemini) rather than a single Anthropic judge.
Then we ran the largest comprehensive sweep we'd done: 25 models across 5 providers, $1.92 in total API spend, three hours wall-clock. This post is the scoreboard.
Every model hit CAPABLE
From qwen3:14b running locally on a single RTX 5080
to claude-opus-4-7 to gemini-3.1-pro-preview,
every single one of the 25 entries cleared all three per-axis
thresholds (≥80% structural, ≥80% functional, ≥0.70 semantic).
The bench can no longer DISTINGUISH most third-party models from the Anthropic baseline at the verdict level — the prior sweeps' classifications were essentially case-design and judge-bias noise. The real signal is now cost-per-pass and overall-pass count, not the CAPABLE verdict itself.
Top performers by overall-pass
| Rank | Provider / Model | Overall | Semantic | $/pass | Cost | s/case |
|---|---|---|---|---|---|---|
| 1 | ollama / deepseek-v4-pro | 14/16 | 0.91 | $0.0022 | $0.030 | 6.93 |
| 1 | ollama / glm-4.7 | 14/16 | 0.89 | $0.0021 | $0.030 | 6.51 |
| 3 | anthropic / claude-haiku-4-5 | 13/16 | 0.87 | $0.0135 | $0.176 | 7.36 |
| 3 | gemini / gemini-2.5-flash-lite | 13/16 | 0.88 | $0.0016 | $0.021 | 6.22 |
| 3 | gemini / gemini-2.5-pro | 13/16 | 0.81 | $0.0068 | $0.088 | 6.08 |
| 3 | gemini / gemini-3.1-pro-preview | 13/16 | 0.87 | $0.0017 | $0.022 | 6.44 |
| 3 | ollama / gpt-oss:120b | 13/16 | 0.81 | $0.0023 | $0.029 | 7.98 |
| 3 | ollama / minimax-m2.7 | 13/16 | 0.87 | $0.0023 | $0.030 | 6.71 |
| 3 | ollama / qwen3-coder-next | 13/16 | 0.90 | $0.0023 | $0.030 | 7.04 |
ollama/deepseek-v4-pro tops both
overall-pass (14/16) AND semantic (0.91) — at
$0.0022/pass it's 6× cheaper than haiku and 75× cheaper than opus
for strictly better outcomes. gemini-2.5-pro is the
only model with 16/16 structural (perfect schema compliance) and
the fastest in the sweep at 6.08 s/case.
Cost-per-pass leaderboard
Sorted cheapest first. Every entry below cleared CAPABLE; the column to weight is $/pass against overall-pass count.
| Rank | Model | $/pass | Overall |
|---|---|---|---|
| 1–2 | ollama-local / qwen3:14b, glm-4.7-flash:q4_K_M | $0 | 12/16 |
| 3–4 | deepseek / deepseek-v4-flash, v4-pro (direct) | $0.0015 | 12/16 |
| 5 | gemini / gemini-2.5-flash-lite | $0.0016 | 13/16 |
| 6 | gemini / gemini-3.1-pro-preview | $0.0017 | 13/16 |
| 11 | ollama / glm-4.7 | $0.0021 | 14/16 |
| 12 | ollama / deepseek-v4-pro | $0.0022 | 14/16 |
| 22 | gemini / gemini-2.5-pro | $0.0068 | 13/16 |
| 23 | anthropic / claude-haiku-4-5 | $0.0135 | 13/16 |
| 24 | anthropic / claude-sonnet-4-6 | $0.0164 | 11/16 |
| 25 | anthropic / claude-opus-4-7 | $0.0821 | 12/16 |
Anthropic models are now the most expensive 3 in the
entire 25-model field. Even sonnet at $0.0164/pass is
10× more expensive than gemini-2.5-flash-lite for an
objectively lower overall-pass count. Opus is 50× more
expensive than direct DeepSeek for the same CAPABLE verdict and a
lower semantic average.
Three findings worth taking away
- Opus 4.7 is the worst $/perf in the field. Same CAPABLE verdict as the cheapest open-weights model, but at $0.0821/pass with a lower semantic average (0.86) than contenders that cost a fraction. We've removed it from our production tier policy. The only use case where opus survives is one where you specifically need its reasoning depth and the premium is acceptable; for general agentic tool-use, the math does not work.
- Multi-judge consensus changed which A/B answer was correct. Last week's bench (single Anthropic judge) said direct DeepSeek API beat Ollama-Cloud-hosted DeepSeek by 2 pass points. This week's bench (consensus of three judges) says Ollama Cloud beats direct DeepSeek by ~0.07 semantic average at 63% cost premium. The relative ranking is the trustworthy signal; absolute scores still drift with judge choice. Treat single-judge results as suggestive, not decisive.
-
Local quantized models are now production-credible.
glm-4.7-flash:q4_K_Mrunning on a single RTX 5080 at $0 cost matches haiku on overall-pass (12/16 vs 13/16) and beats haiku on speed (6.4 vs 7.4 s/case).qwen3:14bties it. For cost-sensitive deployments where you control the hardware, the cloud-vs-local cost gap is not 10× — it's ∞×.
What we picked
The tier-policy decisions emerging from Sweep #6, for our production routing in loomcycle:
- Premium tier (quality-critical agents):
ollama/deepseek-v4-proprimary,ollama/qwen3-coder-nextfallback. 0.91 / 0.90 semantic at 1/75 the cost of opus. - Mid tier (default):
gemini/gemini-2.5-flash-liteprimary,gemini/gemini-3.1-pro-previewfallback. Cheapest cloud entries that still hit 13/16 overall. - Local / free tier:
ollama-local/glm-4.7-flash:q4_K_Mfor offline development;ollama-local/qwen3:14bfor non-time-sensitive batch work. Both $0. - Removed:
claude-opus-4-7,claude-sonnet-4-6,claude-haiku-4-5from the default tier candidates. Kept as available-on-request for specific workloads where the cost premium is justified.
One closing note on methodology
The biggest meta-finding across this bench journey: a single judge model imports its own family's biases into the semantic scores. Anthropic's judge ran harshest on third-party models; running consensus across three families brought third-party scores up across the board (glm-4.7 moved +0.26 on semantic with consensus vs single-judge; even Sonnet itself moved +0.10). The relative ordering survives — that's the signal you can trust. Absolute semantic averages are not portable across single-judge / multi-judge regimes.
We're done with the bench iteration for now. v3 + multi-judge is a stable foundation. The next sweep happens when there's a new model family worth admitting or a new case dimension worth adding.