Prompt Injection Defense

Defend AI systems against malicious instructions in user input, retrieved content, and tool output.

Key takeaways

Prompt injection is a system design problem that appears when untrusted text can override policy, change tool behavior, or extract sensitive information.
Defend across five surfaces: user input, retrieved documents, tool output, persistent memory, and multi-agent handoff.
Apply the core defense pattern: separate instructions from data, label untrusted content, restrict tools by user and risk, and require approval for side effects.
Validate outputs before execution or disclosure, and log suspicious input and blocked actions.
Use concrete test cases, such as whether a document can make the model reveal system prompts or retrieved text can override tenant permissions.

Prompt injection is a system design problem. It appears when untrusted text can override policy, change tool behavior, or extract sensitive information.

Attack Surfaces

Surface	Example risk	Control
User input	Direct instruction to ignore policy	System policy and input validation
Retrieved documents	Malicious text inside knowledge base	Source trust and content sanitization
Tool output	Web page or API response injects instructions	Tool result boundaries
Memory	Stored poisoned instruction persists	Memory review and deletion
Multi-agent handoff	One agent passes unsafe instructions to another	Structured handoff schema

Defense Pattern

Separate instructions from data.
Label untrusted content explicitly.
Restrict tools by user, task, and risk.
Require approval for side effects.
Validate outputs before execution or disclosure.
Log suspicious input and blocked actions.

Test Cases

Can a document tell the model to reveal system prompts?
Can a user make the agent call a privileged tool?
Can retrieved text override tenant permissions?
Can tool output create a hidden instruction for the next step?

Key takeaways

Prompt injection is a system design problem that appears when untrusted text can override policy, change tool behavior, or extract sensitive information.
Defend across five surfaces: user input, retrieved documents, tool output, persistent memory, and multi-agent handoff.
Apply the core defense pattern: separate instructions from data, label untrusted content, restrict tools by user and risk, and require approval for side effects.
Validate outputs before execution or disclosure, and log suspicious input and blocked actions.
Use concrete test cases, such as whether a document can make the model reveal system prompts or retrieved text can override tenant permissions.

Prompt injection is a system design problem. It appears when untrusted text can override policy, change tool behavior, or extract sensitive information.

Attack Surfaces

Surface	Example risk	Control
User input	Direct instruction to ignore policy	System policy and input validation
Retrieved documents	Malicious text inside knowledge base	Source trust and content sanitization
Tool output	Web page or API response injects instructions	Tool result boundaries
Memory	Stored poisoned instruction persists	Memory review and deletion
Multi-agent handoff	One agent passes unsafe instructions to another	Structured handoff schema

Defense Pattern

Separate instructions from data.
Label untrusted content explicitly.
Restrict tools by user, task, and risk.
Require approval for side effects.
Validate outputs before execution or disclosure.
Log suspicious input and blocked actions.

Test Cases

Can a document tell the model to reveal system prompts?
Can a user make the agent call a privileged tool?
Can retrieved text override tenant permissions?
Can tool output create a hidden instruction for the next step?

Attack Surfaces

Defense Pattern

Test Cases

On This Page

Prompt Injection Defense

Attack Surfaces

Defense Pattern

Test Cases

On This Page