§ benchmarks · final sweep

The final bench scoreboard — 25 models, $21.92, all CAPABLE

2026-05-15 · by Loomcycle · ~4 min read

Following yesterday's retraction — where we found that our bench's v1 case authoring was punishing models for *not* making up REST-style parameters that the actual MCP tools didn't accept — we shipped a v3 bench with four case-design fixes, per-case allowed_tools narrowing, and (most importantly) a multi-judge consensus on the semantic axis: scores are the median of three judges from three different families (Anthropic + DeepSeek + Gemini) rather than a single Anthropic judge.

Then we ran the largest comprehensive sweep we'd done: 25 models across 5 providers, $1.92 in total API spend, three hours wall-clock. This post is the scoreboard.

Every model hit CAPABLE

From qwen3:14b running locally on a single RTX 5080 to claude-opus-4-7 to gemini-3.1-pro-preview, every single one of the 25 entries cleared all three per-axis thresholds (≥80% structural, ≥80% functional, ≥0.70 semantic).

The bench can no longer DISTINGUISH most third-party models from the Anthropic baseline at the verdict level — the prior sweeps' classifications were essentially case-design and judge-bias noise. The real signal is now cost-per-pass and overall-pass count, not the CAPABLE verdict itself.

Top performers by overall-pass

Rank	Provider / Model	Overall	Semantic	$/pass	Cost	s/case
1	ollama / deepseek-v4-pro	14/16	0.91	$0.0022	$0.030	6.93
1	ollama / glm-4.7	14/16	0.89	$0.0021	$0.030	6.51
3	anthropic / claude-haiku-4-5	13/16	0.87	$0.0135	$0.176	7.36
3	gemini / gemini-2.5-flash-lite	13/16	0.88	$0.0016	$0.021	6.22
3	gemini / gemini-2.5-pro	13/16	0.81	$0.0068	$0.088	6.08
3	gemini / gemini-3.1-pro-preview	13/16	0.87	$0.0017	$0.022	6.44
3	ollama / gpt-oss:120b	13/16	0.81	$0.0023	$0.029	7.98
3	ollama / minimax-m2.7	13/16	0.87	$0.0023	$0.030	6.71
3	ollama / qwen3-coder-next	13/16	0.90	$0.0023	$0.030	7.04

ollama/deepseek-v4-pro tops both overall-pass (14/16) AND semantic (0.91) — at $0.0022/pass it's 6× cheaper than haiku and 75× cheaper than opus for strictly better outcomes. gemini-2.5-pro is the only model with 16/16 structural (perfect schema compliance) and the fastest in the sweep at 6.08 s/case.

Cost-per-pass leaderboard

Sorted cheapest first. Every entry below cleared CAPABLE; the column to weight is $/pass against overall-pass count.

Rank	Model	$/pass	Overall
1–2	ollama-local / qwen3:14b, glm-4.7-flash:q4_K_M	$0	12/16
3–4	deepseek / deepseek-v4-flash, v4-pro (direct)	$0.0015	12/16
5	gemini / gemini-2.5-flash-lite	$0.0016	13/16
6	gemini / gemini-3.1-pro-preview	$0.0017	13/16
11	ollama / glm-4.7	$0.0021	14/16
12	ollama / deepseek-v4-pro	$0.0022	14/16
22	gemini / gemini-2.5-pro	$0.0068	13/16
23	anthropic / claude-haiku-4-5	$0.0135	13/16
24	anthropic / claude-sonnet-4-6	$0.0164	11/16
25	anthropic / claude-opus-4-7	$0.0821	12/16

Anthropic models are now the most expensive 3 in the entire 25-model field. Even sonnet at $0.0164/pass is 10× more expensive than gemini-2.5-flash-lite for an objectively lower overall-pass count. Opus is 50× more expensive than direct DeepSeek for the same CAPABLE verdict and a lower semantic average.

Three findings worth taking away

Opus 4.7 is the worst $/perf in the field. Same CAPABLE verdict as the cheapest open-weights model, but at $0.0821/pass with a lower semantic average (0.86) than contenders that cost a fraction. We've removed it from our production tier policy. The only use case where opus survives is one where you specifically need its reasoning depth and the premium is acceptable; for general agentic tool-use, the math does not work.
Multi-judge consensus changed which A/B answer was correct. Last week's bench (single Anthropic judge) said direct DeepSeek API beat Ollama-Cloud-hosted DeepSeek by 2 pass points. This week's bench (consensus of three judges) says Ollama Cloud beats direct DeepSeek by ~0.07 semantic average at 63% cost premium. The relative ranking is the trustworthy signal; absolute scores still drift with judge choice. Treat single-judge results as suggestive, not decisive.
Local quantized models are now production-credible. glm-4.7-flash:q4_K_M running on a single RTX 5080 at $0 cost matches haiku on overall-pass (12/16 vs 13/16) and beats haiku on speed (6.4 vs 7.4 s/case). qwen3:14b ties it. For cost-sensitive deployments where you control the hardware, the cloud-vs-local cost gap is not 10× — it's ∞×.

What we picked

The tier-policy decisions emerging from Sweep #6, for our production routing in loomcycle:

Premium tier (quality-critical agents): ollama/deepseek-v4-pro primary, ollama/qwen3-coder-next fallback. 0.91 / 0.90 semantic at 1/75 the cost of opus.
Mid tier (default): gemini/gemini-2.5-flash-lite primary, gemini/gemini-3.1-pro-preview fallback. Cheapest cloud entries that still hit 13/16 overall.
Local / free tier: ollama-local/glm-4.7-flash:q4_K_M for offline development; ollama-local/qwen3:14b for non-time-sensitive batch work. Both $0.
Removed: claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5 from the default tier candidates. Kept as available-on-request for specific workloads where the cost premium is justified.

One closing note on methodology

The biggest meta-finding across this bench journey: a single judge model imports its own family's biases into the semantic scores. Anthropic's judge ran harshest on third-party models; running consensus across three families brought third-party scores up across the board (glm-4.7 moved +0.26 on semantic with consensus vs single-judge; even Sonnet itself moved +0.10). The relative ordering survives — that's the signal you can trust. Absolute semantic averages are not portable across single-judge / multi-judge regimes.

We're done with the bench iteration for now. v3 + multi-judge is a stable foundation. The next sweep happens when there's a new model family worth admitting or a new case dimension worth adding.