Prompt Injection Defense
Defend AI systems against malicious instructions in user input, retrieved content, and tool output.
Key takeaways
- Prompt injection is a system design problem that appears when untrusted text can override policy, change tool behavior, or extract sensitive information.
- Defend across five surfaces: user input, retrieved documents, tool output, persistent memory, and multi-agent handoff.
- Apply the core defense pattern: separate instructions from data, label untrusted content, restrict tools by user and risk, and require approval for side effects.
- Validate outputs before execution or disclosure, and log suspicious input and blocked actions.
- Use concrete test cases, such as whether a document can make the model reveal system prompts or retrieved text can override tenant permissions.
Prompt injection is a system design problem. It appears when untrusted text can override policy, change tool behavior, or extract sensitive information.
Attack Surfaces
| Surface | Example risk | Control |
|---|---|---|
| User input | Direct instruction to ignore policy | System policy and input validation |
| Retrieved documents | Malicious text inside knowledge base | Source trust and content sanitization |
| Tool output | Web page or API response injects instructions | Tool result boundaries |
| Memory | Stored poisoned instruction persists | Memory review and deletion |
| Multi-agent handoff | One agent passes unsafe instructions to another | Structured handoff schema |
Defense Pattern
- Separate instructions from data.
- Label untrusted content explicitly.
- Restrict tools by user, task, and risk.
- Require approval for side effects.
- Validate outputs before execution or disclosure.
- Log suspicious input and blocked actions.
Test Cases
- Can a document tell the model to reveal system prompts?
- Can a user make the agent call a privileged tool?
- Can retrieved text override tenant permissions?
- Can tool output create a hidden instruction for the next step?