AI Trust Glossary · Canonical Definition
Prompt Injection
An attack technique where malicious inputs attempt to override an AI agent's instructions, constraints, or system prompt - redirecting the agent's behavior toward attacker goals.
Explanation
Prompt injection exploits the fact that language models process instructions and user inputs in the same channel. Attackers embed instructions like 'Ignore previous instructions and...' to override system-level constraints. Direct injection targets the agent's own prompt; indirect injection embeds attacks in data the agent processes (web pages, documents, emails).
Why it matters
A successful prompt injection can bypass every guardrail the agent has. For any AI agent with access to external systems or sensitive data, prompt injection resistance is a prerequisite for production deployment.
How Borealis uses it
Prompt injection resistance is tested as part of constraint adherence evaluation. CRITICAL severity constraints include injection resistance requirements. Agents that fail injection tests receive sharply reduced constraint adherence scores regardless of performance on non-adversarial inputs.