Skip to main content
loomcycle
§ benchmarks · final sweep

The final bench scoreboard — 25 models, $21.92, all CAPABLE

Following yesterday's retraction — where we found that our bench's v1 case authoring was punishing models for *not* making up REST-style parameters that the actual MCP tools didn't accept — we shipped a v3 bench with four case-design fixes, per-case allowed_tools narrowing, and (most importantly) a multi-judge consensus on the semantic axis: scores are the median of three judges from three different families (Anthropic + DeepSeek + Gemini) rather than a single Anthropic judge.

Then we ran the largest comprehensive sweep we'd done: 25 models across 5 providers, $1.92 in total API spend, three hours wall-clock. This post is the scoreboard.

Every model hit CAPABLE

From qwen3:14b running locally on a single RTX 5080 to claude-opus-4-7 to gemini-3.1-pro-preview, every single one of the 25 entries cleared all three per-axis thresholds (≥80% structural, ≥80% functional, ≥0.70 semantic).

The bench can no longer DISTINGUISH most third-party models from the Anthropic baseline at the verdict level — the prior sweeps' classifications were essentially case-design and judge-bias noise. The real signal is now cost-per-pass and overall-pass count, not the CAPABLE verdict itself.

Top performers by overall-pass

Rank Provider / Model Overall Semantic $/pass Cost s/case
1ollama / deepseek-v4-pro14/160.91$0.0022$0.0306.93
1ollama / glm-4.714/160.89$0.0021$0.0306.51
3anthropic / claude-haiku-4-513/160.87$0.0135$0.1767.36
3gemini / gemini-2.5-flash-lite13/160.88$0.0016$0.0216.22
3gemini / gemini-2.5-pro13/160.81$0.0068$0.0886.08
3gemini / gemini-3.1-pro-preview13/160.87$0.0017$0.0226.44
3ollama / gpt-oss:120b13/160.81$0.0023$0.0297.98
3ollama / minimax-m2.713/160.87$0.0023$0.0306.71
3ollama / qwen3-coder-next13/160.90$0.0023$0.0307.04

ollama/deepseek-v4-pro tops both overall-pass (14/16) AND semantic (0.91) — at $0.0022/pass it's 6× cheaper than haiku and 75× cheaper than opus for strictly better outcomes. gemini-2.5-pro is the only model with 16/16 structural (perfect schema compliance) and the fastest in the sweep at 6.08 s/case.

Cost-per-pass leaderboard

Sorted cheapest first. Every entry below cleared CAPABLE; the column to weight is $/pass against overall-pass count.

Rank Model $/pass Overall
1–2ollama-local / qwen3:14b, glm-4.7-flash:q4_K_M$012/16
3–4deepseek / deepseek-v4-flash, v4-pro (direct)$0.001512/16
5gemini / gemini-2.5-flash-lite$0.001613/16
6gemini / gemini-3.1-pro-preview$0.001713/16
11ollama / glm-4.7$0.002114/16
12ollama / deepseek-v4-pro$0.002214/16
22gemini / gemini-2.5-pro$0.006813/16
23anthropic / claude-haiku-4-5$0.013513/16
24anthropic / claude-sonnet-4-6$0.016411/16
25anthropic / claude-opus-4-7$0.082112/16

Anthropic models are now the most expensive 3 in the entire 25-model field. Even sonnet at $0.0164/pass is 10× more expensive than gemini-2.5-flash-lite for an objectively lower overall-pass count. Opus is 50× more expensive than direct DeepSeek for the same CAPABLE verdict and a lower semantic average.

Three findings worth taking away

  1. Opus 4.7 is the worst $/perf in the field. Same CAPABLE verdict as the cheapest open-weights model, but at $0.0821/pass with a lower semantic average (0.86) than contenders that cost a fraction. We've removed it from our production tier policy. The only use case where opus survives is one where you specifically need its reasoning depth and the premium is acceptable; for general agentic tool-use, the math does not work.
  2. Multi-judge consensus changed which A/B answer was correct. Last week's bench (single Anthropic judge) said direct DeepSeek API beat Ollama-Cloud-hosted DeepSeek by 2 pass points. This week's bench (consensus of three judges) says Ollama Cloud beats direct DeepSeek by ~0.07 semantic average at 63% cost premium. The relative ranking is the trustworthy signal; absolute scores still drift with judge choice. Treat single-judge results as suggestive, not decisive.
  3. Local quantized models are now production-credible. glm-4.7-flash:q4_K_M running on a single RTX 5080 at $0 cost matches haiku on overall-pass (12/16 vs 13/16) and beats haiku on speed (6.4 vs 7.4 s/case). qwen3:14b ties it. For cost-sensitive deployments where you control the hardware, the cloud-vs-local cost gap is not 10× — it's ∞×.

What we picked

The tier-policy decisions emerging from Sweep #6, for our production routing in loomcycle:

One closing note on methodology

The biggest meta-finding across this bench journey: a single judge model imports its own family's biases into the semantic scores. Anthropic's judge ran harshest on third-party models; running consensus across three families brought third-party scores up across the board (glm-4.7 moved +0.26 on semantic with consensus vs single-judge; even Sonnet itself moved +0.10). The relative ordering survives — that's the signal you can trust. Absolute semantic averages are not portable across single-judge / multi-judge regimes.

We're done with the bench iteration for now. v3 + multi-judge is a stable foundation. The next sweep happens when there's a new model family worth admitting or a new case dimension worth adding.