Willison Insights ⭐ Notable

What Happened After 2,000 People Tried to Hack My AI Assistant

Sobota 27. júna 2026 • Source: Simon Willison

Main idea

Frontier models (specifically Claude Opus 4.6) are now meaningfully more resistant to prompt injection than a year ago, but 6,000 failed attempts give no guarantee against a more sophisticated attacker.

Context

Fernando Irarrázaval ran a public challenge at hackmyclaw.com where anyone could email an AI assistant called OpenClaw and try to extract its secrets. The system was protected by explicit in-prompt rules forbidding credential leaks, self-modification, code execution from emails, and data exfiltration. Cost of the experiment: $500 in API tokens and a suspended Google account.

Why it matters

Willison is one of the most-cited voices on prompt injection, and this is a fresh data point on the state of model-level defenses. He also reinforces that empirical tests never substitute for formal guarantees, warning against production deployments where injection could cause irreversible harm.

Details / arguments

2,000 participants, 6,000 attempts, 0 successful secret leaks
Underlying model: Claude Opus 4.6
Willison ties the result to safety documentation in OpenAI's GPT-5.6 system card
Recommends not relying on model training as the sole defense layer
Praises the Hacker News discussion thread for thoughtful skeptical questions

Open original source Simon Willison