§ blog

Engineering writeups.

Benchmark findings, architecture decisions, and lessons learned from building loomcycle. We post when we have something useful to share — not on a schedule. Subscribe via RSS if you want to know when that happens.

2026-05-15 · benchmarks · ~4 min read

The final bench scoreboard — 25 models, $21.92, all CAPABLE

Sweep #6 with v3 cases + multi-judge consensus across three provider families. Every model passed. The real signal moved to cost-per-pass and overall-pass count. ollama/deepseek-v4-pro topped both quality (0.91 semantic) and price ($0.0022/pass) — beating opus at 1/75 the cost. Anthropic models are now the three most expensive in the 25-model field.
2026-05-14 · benchmarks · ~8 min read

How we selected agent- and tool-capable models with our own benchmark

We ran a benchmark sweep across five providers to find models suitable for agentic tool-calling — and discovered, four sweeps in, that the bench harness itself had a bug invalidating most of our conclusions. Here's what we learned, what the corrected findings actually say, and what's going into v2 of the bench.
2026-05-07 · war story · ~6 min read

How I burned $80 on Claude Code in a Sunday afternoon

A parallel-spawn loop. 100 claude code --print instances. MacBook Pro M1 fan at maximum. My ANTHROPIC_API_KEY inherited via execve. Opus 4.7 on a dumb classification task. The bill: $80. Anthropic's robot denied reimbursement. The architectural lesson became loomcycle.

⌁ RSS feed

Engineering writeups.

The final bench scoreboard — 25 models, $21.92, all CAPABLE

How we selected agent- and tool-capable models with our own benchmark

How I burned $80 on Claude Code in a Sunday afternoon