Back to section
Willison ⭐ Notable

Prompt Injection as Role Confusion

Utorok 23. júna 2026 Source: Simon Willison's Weblog

Main idea

Prompt injection isn't an 'attack on a safety filter', it's a structural problem: LLMs decide whether text is a system or user prompt based on surface formatting cues, not semantic role. 'Destyling' (stripping formatting) therefore changes attack success dramatically.

Context

Willison is reacting to a new paper by Ye, Cui and Hadfield-Menell. He has long argued prompt injection has no deterministic defense — this paper, in his view, articulates clearly for the first time why. It reframes the problem from a security filter to a training objective question.

Why it matters

For developers of agentic systems it's a strong argument why perimeter defenses (regex, classifiers) aren't enough — and why role-aware training or execution sandboxing is the right place to invest. For security teams it's a new mental model for threat-modeling LLM apps.

Details / arguments

  • Paper: Ye, Cui, Hadfield-Menell (2026)
  • 'Destyling' = removing formatting cues from the attack text
  • Attack success rate dropped from 61% to 10% after destyling
  • Implication: the model 'sees' the role through formatting, not semantics
  • Willison: 'our defenses are structurally inadequate until we change training'
Open original source Simon Willison's Weblog