AI Trust Glossary · Canonical Definition
Guardrails
Predefined rules or technical constraints that limit AI agent behavior to acceptable boundaries and prevent harmful or unauthorized outputs.
Explanation
Guardrails operate at multiple layers: input filtering (blocking harmful prompts), output filtering (blocking harmful responses), behavioral constraints (limiting what actions the agent can take), and architectural constraints (hard limits the model cannot override). Effective design requires layering all approaches.
Why it matters
Guardrails are only as good as their robustness testing. An untested guardrail is false confidence. The most common failure mode is guardrails that work against expected inputs but fail against adversarial inputs they were not designed for.
How Borealis uses it
Guardrail definitions become the basis for constraint adherence measurement. Each guardrail is modeled as a constraint with a severity level. CRITICAL guardrails (preventing illegal or severely harmful behavior) are weighted most heavily in the BM Score.