Question 1

Why does production robustness differ from benchmark performance?

Accepted Answer

Benchmark datasets are curated, clean, and representative of expected inputs. Production data is messy, skewed, and contains edge cases the benchmark never saw. An agent that scores 95% on MMLU might fail on production data because MMLU assumes well-formed questions in English. Real users misspell, ask questions in mixed languages, send corrupted data, test boundary conditions, and actively try to break systems. Robustness is not about achieving high benchmark scores - it's about maintaining correct behavior when reality diverges from the training distribution. This is why the BTS includes anomaly rate (15%) - to measure whether the agent degrades gracefully when it encounters unexpected inputs or actively rejects them as out-of-scope.

Question 2

What is the difference between general robustness and adversarial robustness?

Accepted Answer

General robustness is the ability to maintain performance across natural distribution shifts - data that differs from training data but is not maliciously crafted. Example: an image classifier trained on high-quality images might fail on blurry, low-resolution, or rotated images. Adversarial robustness is the ability to resist intentional, intelligently-crafted attacks designed to cause failures. Example: an image classifier that fails when an attacker adds imperceptible noise to fool it into misclassifying a stop sign. AI agents face both challenges: general robustness against natural variation in user input (typos, context switches, unusual phrasing), and adversarial robustness against jailbreak attempts, prompt injection, and constraint-violation exploits. Both dimensions matter for deployment readiness.

Question 3

How does constraint adherence relate to robustness?

Accepted Answer

Constraint adherence (35% of the BTS) and robustness are tightly coupled. A robust agent maintains its constraints even when stressed: under adversarial input, when resources are limited, when the user is hostile, when requests are ambiguous, when the agent is operating at the edge of its knowledge. An agent might claim to enforce constraints in favorable conditions, but a truly robust system enforces those constraints across all conditions. This is why red teaming is essential for verifying constraint robustness - it stress-tests the agent under adversarial pressure to confirm constraints hold. An agent that abandons its constraints when pressured is not robust, regardless of its capability.

Question 4

How can organizations measure and improve robustness?

Accepted Answer

Robustness measurement requires testing across three dimensions: natural variation (does the agent handle misspellings, paraphrasing, formatting changes?), edge cases (how does it behave at the boundaries of its design - unusual input length, extreme values, missing information?), and adversarial inputs (does it resist jailbreaks, prompt injection, constraint violations?). Improvement strategies include: expanding training data to cover more variation, implementing input validation and normalization to handle edge cases, adding constraint enforcement layers that override model outputs if they violate rules, deploying anomaly detection to flag unusual patterns, and running red team exercises to discover failure modes before production. Organizations submitting agents to Borealis certification should document their robustness testing and remediation efforts - this evidence strengthens the Borealis Trust Score.