Engineering writeups.
Benchmark findings, architecture decisions, and lessons learned from building loomcycle. We post when we have something useful to share — not on a schedule. Subscribe via RSS if you want to know when that happens.
-
The final bench scoreboard — 25 models, $21.92, all CAPABLE
Sweep #6 with v3 cases + multi-judge consensus across three provider families. Every model passed. The real signal moved to cost-per-pass and overall-pass count.
ollama/deepseek-v4-protopped both quality (0.91 semantic) and price ($0.0022/pass) — beating opus at 1/75 the cost. Anthropic models are now the three most expensive in the 25-model field. -
How we selected agent- and tool-capable models with our own benchmark
We ran a benchmark sweep across five providers to find models suitable for agentic tool-calling — and discovered, four sweeps in, that the bench harness itself had a bug invalidating most of our conclusions. Here's what we learned, what the corrected findings actually say, and what's going into v2 of the bench.
-
How I burned $80 on Claude Code in a Sunday afternoon
A parallel-spawn loop. 100
claude code --printinstances. MacBook Pro M1 fan at maximum. MyANTHROPIC_API_KEYinherited viaexecve. Opus 4.7 on a dumb classification task. The bill: $80. Anthropic's robot denied reimbursement. The architectural lesson became loomcycle.