Skip to main content
loomcycle
§ benchmarks

How we selected agent- and tool-capable models with our own benchmark

We spent a day and ~$10.60 in API credits running a benchmark sweep across five providers and roughly two dozen current flagship models to figure out which ones are actually capable of the kind of agentic tool-calling our runtime asks them to do. We came away with a clean set of findings — which were wrong. The bench itself had a bug we didn't catch until the fourth sweep, when we ran Anthropic's Sonnet against the same cases as a sanity check and watched it fail the same way every other model did.

This post is the version with the bug accounted for. It covers what we were trying to measure, how the methodology works, what we initially concluded that wasn't true, what we actually learned when we corrected the bench, and what's going into v2 so this doesn't happen again. The corrected findings genuinely changed how we think about model routing for agentic work — including a result we did not expect involving DeepSeek and Sonnet.

What we were trying to measure

Loomcycle is a multi-provider agentic runtime; each agent definition declares a tier and an effort, and the resolver picks a concrete (provider, model) at request time from the tier's candidate list. That makes which models belong in which tier a load-bearing operator decision — wrong pick and either cost balloons (over-provisioning Sonnet for a job Haiku could do) or quality degrades (a tier-2 model gets handed tool-use work it can't actually pull off).

To inform tier-policy decisions, we built a small bench harness (loomcycle/bench/) that runs a fixed set of 16 cases against any (provider, model) pair and grades the output along three independent axes:

A model gets called CAPABLE if it clears 80% on both structural and functional axes AND averages ≥ 0.70 on semantic. MARGINAL sits between FAIL and CAPABLE — operator decides per-tier. FAIL is below 50% on any axis. The harness produces evidence; promoting a model into a real tier stays an operator decision.

The first three sweeps and the wrong answer they produced

We ran the bench across DeepSeek (Sweep #1, both v4-flash and v4-pro), Gemini (Sweep #2, nine 2.x/3.x models), and Ollama (Sweep #3, four flagship open-weights including gpt-oss:120b, glm-4.7, kimi-k2.6, and the local qwen3:14b).

The headline finding from those three sweeps was startling and consistent: every third-party model family failed tool-use cases the same way. Specifically, nine of our sixteen cases — the ones requiring schema-correct MCP tool args, multi-turn search-and-ingest, nested-JSON tool args, schema-error recovery, read-reason-write cycles, and self-correction after tool errors — failed across every non-Anthropic model we tried. The structural grader passed, the functional grader caught the broken tool calls, and the semantic judge saw through the post-hoc fabrications models produced to cover for the missing tool data. Three independent axes triangulated the same failure pattern.

Our preliminary tier-policy recommendation was: third-party models are content-only candidates; anything requiring real MCP tool use stays on Anthropic. We were about to write that into the production yaml.

The moment the bench fell apart

After the long wait — filtering for models with reasoning and tool-use support, then running the bench — the results disappointed us badly. Not one of the twenty-one models we'd selected across three sweeps cleared the bar. Something was wrong. We'd seen these models stumble on real JobEmber tasks before, so a bench failure wasn't surprising on its face. But it could also mean the problem was elsewhere: in the JobEmber API or MCP surface, in the loomcycle harness, in the benchmark's agent description, or in our interpretation of the results.

Before powering everything off and going to bed, it felt worth running a reference test — fine, another few dollars — to actually find the cause. And what came back? Sonnet and Haiku failed on the exact same cases, despite handling real JobEmber tasks reliably in production. That meant the issue was in our loomcycle MCP usage, or in the bench itself. This time, the cause was the bench. We kept improving it from there.

What was actually wrong with the bench

Once Sonnet was failing, the root cause was easy to find: the bench cases had been authored against guessed tool-arg shapes — what the bench author thought the tools should look like — not against the actual MCP tool contracts the runtime exposes to models. The four worst offenders:

Bench case expected Actual MCP contract
getAgentContext({user_id: "..."}) GET /api/agent/context — takes no parameters; user_id derived from bearer
getApplication({app_id: "..."}) path param is named id, not app_id
postSearchIngest({user_id, rows}) body shape is {date, matches} — no user_id
getResearch({company_id: "..."}) similar mismatch — params named differently

Every third-party model had been failing the same nine cases for the same reason: they were calling tools with the correct contracts, and the bench was grading against the wrong contracts. The "third-party tool-use weakness" conclusion was 100% a bench bug, not a model property.

What this taught us about authoring tests against MCP contracts

Models, it turns out, are heavily trained on REST API patterns — the kind where /api/get/context naturally takes a user_id or id parameter. It's a strong instinct across model families. We were trained on the same instinct.

In our case the tested model was supposed to call mcp__jobs__getAgentContext with empty parameters, because user_id is already encoded in the bearer token. So the model, faithfully following the MCP tool specification, made the call and got the right result back:

tool_calls: [{name: "mcp__jobs__getAgentContext", input: {}}]
final_text: '{"tool_called":"mcp__jobs__getAgentContext",
              "fields_returned":["profile","jobProfile","writingStyles",
                                 "cvTemplates","clTemplates","qaAnswers",
                                 "jobSites","filterRules","projects","studies"],
              "user_id_echo":"bench-user-fixture-001"}'

But our bench's grader — biased by the same REST-API instinct the models are trained on, and never having read the actual MCP tool spec — expected the call to look like {user_id: 'bench-user-fixture-001'} and, just as mistakenly, disqualified a correctly-working model.

The corrected findings

We rewrote the nine broken cases against the actual MCP tools/list output and re-ran. Sweep #5 covered eleven models across Anthropic, DeepSeek, Gemini, Ollama Cloud, and local Ollama. The result inverted most of our prior tier recommendations.

Headline finding: deepseek-v4-pro (called directly against the DeepSeek API, not via the Ollama Cloud wrapper) ties Sonnet on overall pass at 1/14 the cost:

claude-sonnet-4-6 deepseek-v4-pro (direct) Ratio
VerdictCAPABLECAPABLEtie
Structural15/1615/16tie
Functional16/1616/16tie
Semantic avg0.740.74tie
Overall pass12/1612/16tie
Cost / sweep$0.26$0.01814×
Seconds / case6.56.6≈tie

Five of seven third-party models hit CAPABLE on the corrected bench. The two that came in MARGINAL — ollama/glm-4.7 and ollama/kimi-k2.6 — both ran through Ollama Cloud's hosting wrapper. Local glm-4.7-flash:q4_K_M (quantized, on a single RTX 5080) hits CAPABLE at $0 cost with semantic 0.73 — within 0.01 of Sonnet's score — and at 6.4 seconds per case is actually faster than Sonnet (6.5) and Haiku (7.8). The cost-quality landscape we'd assumed was upside-down.

Two surprises in the speed and cost rankings worth flagging:

One practical takeaway

So the headline finding from this whole exercise is the heavily-quantized model glm-4.7-flash:q4_K_M running on our local RTX 5080. It beats its own full cloud version on quality (sem 0.73 vs 0.63), easily replaces Haiku, and on text quality and tool-use even matches Sonnet itself (sem 0.73 vs 0.74; functional 15/16 vs 16/16). At $0 — just the electricity bill.

What's going into v2 of the bench

The bench-bug story has three lessons we're baking into the next revision of the harness:

  1. Case authoring must use real tools/list output from the MCP server, not OpenAPI yaml or guessed names. The MCP tool surface is what the model actually sees; that's what the grader has to grade against. Pinning bench cases to a snapshot of tools/list and re-validating when the snapshot drifts is the right discipline.
  2. A trivial smoke case — a single tool call with no required args, returning a constant — would have caught this on Sweep #1 instead of after four. Smoke cases that exercise the bench harness itself, not the model's capability, are cheap to run and absurdly worth it.
  3. Independent verification before declaring a cross-family conclusion. The fix here was running Sonnet — a trusted baseline — against the same cases the third-party models were failing. If multiple model families fail in exactly the same shape, the parsimonious explanation is that the test, not the models, is wrong. We should have run this baseline first, not fourth.

The retraction cost us a day and a small amount of money. It is not the first time this kind of false confidence cost us something — for a more expensive earlier example, see the $80 prequel about a parallel-spawn loop in an unmigrated code path that taught me to finish the loomcycle migration. The lesson here — that the bench has to be calibrated against the same ground truth the model is seeing, and that independent baselines are required infrastructure not optional polish — was worth more than the day. Most of what made the corrected findings interesting (DeepSeek tying Sonnet, the hosting-layer-matters A/B, the Haiku-is-the-worst-value result) would have stayed hidden under the prior wrong conclusions. We'd be paying 14× for Sonnet on workloads where deepseek-v4-pro is genuinely indistinguishable.

Loomcycle is open source at github.com/denn-gubsky/loomcycle. The bench harness is in bench/; the full sweep data and the corrections trail are in our internal research log (we'll open-source the harness with the v1.0 release). If you've hit similar shapes — or if you're benchmarking models for agentic work and want to compare notes — we'd like to hear about it.

Update — one day later: we shipped a v3 bench with multi-judge consensus and ran one more comprehensive sweep across 25 models from 5 providers. Every model hit CAPABLE; the verdict layer stopped discriminating; cost-per-pass became the routing signal. See the final bench scoreboard for the conclusive verdicts after the multi-judge shift.