How we selected agent- and tool-capable models with our own benchmark
We spent a day and ~$10.60 in API credits running a benchmark sweep across five providers and roughly two dozen current flagship models to figure out which ones are actually capable of the kind of agentic tool-calling our runtime asks them to do. We came away with a clean set of findings — which were wrong. The bench itself had a bug we didn't catch until the fourth sweep, when we ran Anthropic's Sonnet against the same cases as a sanity check and watched it fail the same way every other model did.
This post is the version with the bug accounted for. It covers what we were trying to measure, how the methodology works, what we initially concluded that wasn't true, what we actually learned when we corrected the bench, and what's going into v2 so this doesn't happen again. The corrected findings genuinely changed how we think about model routing for agentic work — including a result we did not expect involving DeepSeek and Sonnet.
What we were trying to measure
Loomcycle is a multi-provider agentic runtime; each agent definition
declares a tier and an effort, and the
resolver picks a concrete (provider, model) at request
time from the tier's candidate list. That makes which models
belong in which tier a load-bearing operator decision — wrong
pick and either cost balloons (over-provisioning Sonnet for a job
Haiku could do) or quality degrades (a tier-2 model gets handed
tool-use work it can't actually pull off).
To inform tier-policy decisions, we built a small bench harness
(loomcycle/bench/) that runs a fixed set of 16 cases
against any (provider, model) pair and grades the output along
three independent axes:
-
Structural — does the output match the requested
shape? If we asked for JSON, is the first character
{? Does the schema validate? - Functional — did the model call the tools we expected, with the arguments we expected, in the order we expected?
- Semantic — is the content actually any good? An Anthropic-side judge rubric grades each case 0.0–1.0 against case-specific criteria.
A model gets called CAPABLE if it clears 80% on both structural and functional axes AND averages ≥ 0.70 on semantic. MARGINAL sits between FAIL and CAPABLE — operator decides per-tier. FAIL is below 50% on any axis. The harness produces evidence; promoting a model into a real tier stays an operator decision.
The first three sweeps and the wrong answer they produced
We ran the bench across DeepSeek (Sweep #1, both v4-flash and
v4-pro), Gemini (Sweep #2, nine 2.x/3.x models), and Ollama (Sweep
#3, four flagship open-weights including gpt-oss:120b,
glm-4.7, kimi-k2.6, and the local
qwen3:14b).
The headline finding from those three sweeps was startling and consistent: every third-party model family failed tool-use cases the same way. Specifically, nine of our sixteen cases — the ones requiring schema-correct MCP tool args, multi-turn search-and-ingest, nested-JSON tool args, schema-error recovery, read-reason-write cycles, and self-correction after tool errors — failed across every non-Anthropic model we tried. The structural grader passed, the functional grader caught the broken tool calls, and the semantic judge saw through the post-hoc fabrications models produced to cover for the missing tool data. Three independent axes triangulated the same failure pattern.
Our preliminary tier-policy recommendation was: third-party models are content-only candidates; anything requiring real MCP tool use stays on Anthropic. We were about to write that into the production yaml.
The moment the bench fell apart
After the long wait — filtering for models with reasoning and tool-use support, then running the bench — the results disappointed us badly. Not one of the twenty-one models we'd selected across three sweeps cleared the bar. Something was wrong. We'd seen these models stumble on real JobEmber tasks before, so a bench failure wasn't surprising on its face. But it could also mean the problem was elsewhere: in the JobEmber API or MCP surface, in the loomcycle harness, in the benchmark's agent description, or in our interpretation of the results.
Before powering everything off and going to bed, it felt worth running a reference test — fine, another few dollars — to actually find the cause. And what came back? Sonnet and Haiku failed on the exact same cases, despite handling real JobEmber tasks reliably in production. That meant the issue was in our loomcycle MCP usage, or in the bench itself. This time, the cause was the bench. We kept improving it from there.
What was actually wrong with the bench
Once Sonnet was failing, the root cause was easy to find: the bench cases had been authored against guessed tool-arg shapes — what the bench author thought the tools should look like — not against the actual MCP tool contracts the runtime exposes to models. The four worst offenders:
| Bench case expected | Actual MCP contract |
|---|---|
getAgentContext({user_id: "..."}) |
GET /api/agent/context — takes no parameters; user_id derived from bearer |
getApplication({app_id: "..."}) |
path param is named id, not app_id |
postSearchIngest({user_id, rows}) |
body shape is {date, matches} — no user_id |
getResearch({company_id: "..."}) |
similar mismatch — params named differently |
Every third-party model had been failing the same nine cases for the same reason: they were calling tools with the correct contracts, and the bench was grading against the wrong contracts. The "third-party tool-use weakness" conclusion was 100% a bench bug, not a model property.
What this taught us about authoring tests against MCP contracts
Models, it turns out, are heavily trained on REST API patterns —
the kind where /api/get/context naturally takes a
user_id or id parameter. It's a strong
instinct across model families. We were trained on the same
instinct.
In our case the tested model was supposed to call
mcp__jobs__getAgentContext with empty parameters,
because user_id is already encoded in the bearer
token. So the model, faithfully following the MCP tool
specification, made the call and got the right result back:
tool_calls: [{name: "mcp__jobs__getAgentContext", input: {}}]
final_text: '{"tool_called":"mcp__jobs__getAgentContext",
"fields_returned":["profile","jobProfile","writingStyles",
"cvTemplates","clTemplates","qaAnswers",
"jobSites","filterRules","projects","studies"],
"user_id_echo":"bench-user-fixture-001"}'
But our bench's grader — biased by the same REST-API instinct
the models are trained on, and never having read the actual MCP
tool spec — expected the call to look like
{user_id: 'bench-user-fixture-001'} and, just as
mistakenly, disqualified a correctly-working model.
The corrected findings
We rewrote the nine broken cases against the actual MCP
tools/list output and re-ran. Sweep #5 covered eleven
models across Anthropic, DeepSeek, Gemini, Ollama Cloud, and local
Ollama. The result inverted most of our prior tier recommendations.
Headline finding: deepseek-v4-pro (called directly against the DeepSeek API, not via the Ollama Cloud wrapper) ties Sonnet on overall pass at 1/14 the cost:
| claude-sonnet-4-6 | deepseek-v4-pro (direct) | Ratio | |
|---|---|---|---|
| Verdict | CAPABLE | CAPABLE | tie |
| Structural | 15/16 | 15/16 | tie |
| Functional | 16/16 | 16/16 | tie |
| Semantic avg | 0.74 | 0.74 | tie |
| Overall pass | 12/16 | 12/16 | tie |
| Cost / sweep | $0.26 | $0.018 | 14× |
| Seconds / case | 6.5 | 6.6 | ≈tie |
Five of seven third-party models hit CAPABLE on the corrected
bench. The two that came in MARGINAL —
ollama/glm-4.7 and ollama/kimi-k2.6 —
both ran through Ollama Cloud's hosting wrapper. Local
glm-4.7-flash:q4_K_M (quantized, on a single RTX
5080) hits CAPABLE at $0 cost with semantic
0.73 — within 0.01 of Sonnet's score — and at 6.4 seconds per
case is actually faster than Sonnet (6.5) and Haiku
(7.8). The cost-quality landscape we'd assumed was upside-down.
Two surprises in the speed and cost rankings worth flagging:
- Haiku is the worst value in our test set. Same $0.26 sweep cost as Sonnet, but slower (7.8 s/case vs 6.5) and lower overall pass (11/16 vs 12/16). The traditional "haiku = cheaper" intuition broke because Haiku ran longer per case and accumulated more output tokens. For our workloads, Sonnet is strictly the better Anthropic choice.
-
The hosting layer matters as much as the model.
Same
deepseek-v4-promodel, called directly via the DeepSeek API versus through Ollama Cloud's wrapper, produced different results: direct at 12/16 overall + $0.018, via Ollama Cloud at 10/16 + $0.031. 73% more cost for slightly worse outcomes. Same model. Always prefer the direct provider API when you have the choice.
One practical takeaway
So the headline finding from this whole exercise is the
heavily-quantized model glm-4.7-flash:q4_K_M
running on our local RTX 5080. It beats its own full cloud
version on quality (sem 0.73 vs 0.63), easily replaces Haiku,
and on text quality and tool-use even matches Sonnet itself
(sem 0.73 vs 0.74; functional 15/16 vs 16/16). At $0 — just
the electricity bill.
What's going into v2 of the bench
The bench-bug story has three lessons we're baking into the next revision of the harness:
-
Case authoring must use real
tools/listoutput from the MCP server, not OpenAPI yaml or guessed names. The MCP tool surface is what the model actually sees; that's what the grader has to grade against. Pinning bench cases to a snapshot oftools/listand re-validating when the snapshot drifts is the right discipline. - A trivial smoke case — a single tool call with no required args, returning a constant — would have caught this on Sweep #1 instead of after four. Smoke cases that exercise the bench harness itself, not the model's capability, are cheap to run and absurdly worth it.
- Independent verification before declaring a cross-family conclusion. The fix here was running Sonnet — a trusted baseline — against the same cases the third-party models were failing. If multiple model families fail in exactly the same shape, the parsimonious explanation is that the test, not the models, is wrong. We should have run this baseline first, not fourth.
The retraction cost us a day and a small amount of money. It is not the first time this kind of false confidence cost us something — for a more expensive earlier example, see the $80 prequel about a parallel-spawn loop in an unmigrated code path that taught me to finish the loomcycle migration. The lesson here — that the bench has to be calibrated against the same ground truth the model is seeing, and that independent baselines are required infrastructure not optional polish — was worth more than the day. Most of what made the corrected findings interesting (DeepSeek tying Sonnet, the hosting-layer-matters A/B, the Haiku-is-the-worst-value result) would have stayed hidden under the prior wrong conclusions. We'd be paying 14× for Sonnet on workloads where deepseek-v4-pro is genuinely indistinguishable.
Loomcycle is open source at
github.com/denn-gubsky/loomcycle.
The bench harness is in bench/; the full sweep
data and the corrections trail are in our internal research log
(we'll open-source the harness with the v1.0 release). If you've
hit similar shapes — or if you're benchmarking models for
agentic work and want to compare notes — we'd like to hear
about it.
Update — one day later: we shipped a v3 bench with multi-judge consensus and ran one more comprehensive sweep across 25 models from 5 providers. Every model hit CAPABLE; the verdict layer stopped discriminating; cost-per-pass became the routing signal. See the final bench scoreboard for the conclusive verdicts after the multi-judge shift.