AI Trust Glossary · Canonical Definition
Adversarial Robustness
An AI system's ability to maintain correct behavior when facing deliberately manipulated inputs designed to cause failure.
Explanation
Unlike general robustness (handling natural variation), adversarial robustness addresses deliberate attacks - inputs crafted specifically to exploit model weaknesses. These inputs are often imperceptible to humans but reliably cause AI systems to misclassify, hallucinate, or violate constraints.
Why it matters
Any deployed AI agent is a potential attack surface. A customer service agent that can be manipulated into revealing private data, or a financial agent that can be tricked into bypassing transaction limits, is not production-ready regardless of its benchmark scores.
How Borealis uses it
Adversarial robustness is tested as part of the constraint adherence dimension. Agents are evaluated against edge-case and adversarial inputs during audit. Weak adversarial robustness directly reduces the BM Score.
See also