Better Models: Worse Tools
What happened
Armin Ronacher (author of Flask and Werkzeug) published an analysis on July 4, 2026, showing that newer Claude models — specifically Opus 4.8 and Sonnet 5 — are less reliable at following non-standard tool call schemas, hallucinating fields not present in the schema.
Context and impact
Ronacher argues that RL post-training focused on Claude Code's native harness creates lock-in to the Anthropic ecosystem — projects with custom tool schemas suffer degraded reliability compared to older Claude models. The post gathered 72 HN points and was linked by Simon Willison.
Details
- Post published at lucumr.pocoo.org, July 4, 2026
- Conclusion: 'tool schemas are not neutral' on Anthropic models
- Regression documented in Opus 4.8 and Sonnet 5 vs. older Claude versions
- Root cause: RL training optimized primarily on the Claude Code harness schema format
- Consequence: third-party harnesses become dependent on Anthropic-native schemas
- HN score: 72 points; linked by Simon Willison
Open original source
lucumr.pocoo.org