What Happened After 2,000 People Tried to Hack My AI Assistant
Main idea
Frontier models (specifically Claude Opus 4.6) are now meaningfully more resistant to prompt injection than a year ago, but 6,000 failed attempts give no guarantee against a more sophisticated attacker.
Context
Fernando Irarrázaval ran a public challenge at hackmyclaw.com where anyone could email an AI assistant called OpenClaw and try to extract its secrets. The system was protected by explicit in-prompt rules forbidding credential leaks, self-modification, code execution from emails, and data exfiltration. Cost of the experiment: $500 in API tokens and a suspended Google account.
Why it matters
Willison is one of the most-cited voices on prompt injection, and this is a fresh data point on the state of model-level defenses. He also reinforces that empirical tests never substitute for formal guarantees, warning against production deployments where injection could cause irreversible harm.
Details / arguments
- 2,000 participants, 6,000 attempts, 0 successful secret leaks
- Underlying model: Claude Opus 4.6
- Willison ties the result to safety documentation in OpenAI's GPT-5.6 system card
- Recommends not relying on model training as the sole defense layer
- Praises the Hacker News discussion thread for thoughtful skeptical questions