Research Glossary Simulator Docs Novels Get Certified
AI Trust Glossary  ·  Canonical Definition

Guardrails

Predefined rules or technical constraints that limit AI agent behavior to acceptable boundaries and prevent harmful or unauthorized outputs.
Borealis Research Team  ·  Updated March 2026  ·  View all 47 terms
Guardrails operate at multiple layers: input filtering (blocking harmful prompts), output filtering (blocking harmful responses), behavioral constraints (limiting what actions the agent can take), and architectural constraints (hard limits the model cannot override). Effective design requires layering all approaches.
Guardrails are only as good as their robustness testing. An untested guardrail is false confidence. The most common failure mode is guardrails that work against expected inputs but fail against adversarial inputs they were not designed for.
Guardrail definitions become the basis for constraint adherence measurement. Each guardrail is modeled as a constraint with a severity level. CRITICAL guardrails (preventing illegal or severely harmful behavior) are weighted most heavily in the BM Score.
Ready to put this into practice?
Certify your AI agent on BorealisMark and get a verifiable BM Score anchored to Hedera Hashgraph. Or run the BM Score Simulator to estimate your agent's score right now.