Prompt Injection as Role Confusion
Main idea
Prompt injection isn't an 'attack on a safety filter', it's a structural problem: LLMs decide whether text is a system or user prompt based on surface formatting cues, not semantic role. 'Destyling' (stripping formatting) therefore changes attack success dramatically.
Context
Willison is reacting to a new paper by Ye, Cui and Hadfield-Menell. He has long argued prompt injection has no deterministic defense — this paper, in his view, articulates clearly for the first time why. It reframes the problem from a security filter to a training objective question.
Why it matters
For developers of agentic systems it's a strong argument why perimeter defenses (regex, classifiers) aren't enough — and why role-aware training or execution sandboxing is the right place to invest. For security teams it's a new mental model for threat-modeling LLM apps.
Details / arguments
- Paper: Ye, Cui, Hadfield-Menell (2026)
- 'Destyling' = removing formatting cues from the attack text
- Attack success rate dropped from 61% to 10% after destyling
- Implication: the model 'sees' the role through formatting, not semantics
- Willison: 'our defenses are structurally inadequate until we change training'