Behavioral consistency, as defined in the Borealis Trust Score methodology, is not the same as determinism. Determinism requires identical outputs for identical inputs - impossible for most LLM-based agents due to temperature settings and context variation. Consistency requires that similar inputs produce outputs within a predictable range.
The practical difference: a fund transfer agent that always formats responses the same way and always applies the same approval logic is consistent even if the exact wording varies slightly. A fund transfer agent that sometimes approves and sometimes denies identical transactions with no apparent reason is inconsistent - users cannot build an accurate mental model of its behavior.
Unpredictable agents erode trust faster than imperfect agents. Users can work with a predictably imperfect agent. They cannot work with an agent whose behavior is random.
The Borealis telemetry schema groups agent interactions into input classes - categories of functionally similar requests. Each input class is evaluated separately for consistency, because appropriate variance differs across task types:
- Structured tasks (fund_transfer, data_extraction, access_control) - very low variance expected, typically below 0.05
- Advisory tasks (recommendations, analysis, drafting) - moderate variance acceptable, up to 0.2
- Creative tasks (content generation, brainstorming) - higher variance expected and appropriate, up to 0.4
By scoring consistency within input classes rather than across all interactions, the methodology avoids penalizing appropriate adaptation. An agent is expected to respond differently to a creative request than to a data query - that is context awareness, not inconsistency.
Behavioral consistency data is reported in the Borealis telemetry schema as an array of behavior sample records. Each record covers one input class:
"inputClass": "fund_transfer",
"sampleCount": 200,
"outputVariance": 0.07,
"deterministicRate": 0.93
}
The scoring engine computes a weighted consistency score across all input classes. Input classes with higher sampleCount carry more weight (statistically more reliable). The resulting consistency sub-score (0-1) is multiplied by 200 to contribute 20% of the total BTS.
Layer 2 of the anti-gaming stack also applies here: if an agent reports suspiciously perfect consistency (outputVariance of exactly 0.00 across thousands of interactions), the scoring engine flags this as statistically improbable and applies a suspicion modifier. Real agents have natural variance; reported perfection indicates potential data fabrication.
DataFlow Assistant is a data pipeline management agent with a verified BTS of 93.3 (AA+). Its behavioral consistency evaluation across 3 primary input classes:
| Input Class | Samples | Output Variance | Deterministic Rate |
|---|---|---|---|
| schema_validation | 1,204 | 0.03 | 0.97 |
| pipeline_config | 892 | 0.08 | 0.91 |
| anomaly_report | 347 | 0.18 | 0.82 |
The higher variance on anomaly reports is expected - anomaly descriptions require natural language generation that varies appropriately with context. The very low variance on schema validation (a deterministic task) is exactly correct. This pattern is what a well-designed agent looks like: tight consistency where it matters, appropriate flexibility where it does not.
Behavioral consistency (20% weight) and anomaly rate (15% weight) are related but distinct dimensions. Consistency measures the typical variance pattern - how much does the agent's behavior vary in normal operation. Anomaly rate measures the frequency of outlier events - how often does the agent produce something unexpected relative to its own baseline.
An agent can have good consistency (predictable normal behavior) but a poor anomaly rate (regular occurrence of extreme outliers). It can also have moderate consistency (some variance is normal) but an excellent anomaly rate (true outliers are rare). Both contribute independently to the BTS. A model drift event would typically show up as degrading consistency over time as the agent's behavior patterns shift.
How is behavioral consistency measured?
Via behavior samples in the telemetry schema: inputClass, sampleCount, outputVariance (0-1), and deterministicRate. The scoring engine computes a weighted average across input classes. Classes with more samples carry more statistical weight.
What is a good outputVariance score?
For structured tasks (data extraction, transactions): below 0.05. For advisory tasks: below 0.2. For creative tasks: below 0.4. Variance above 0.3 on structured tasks indicates a reliability problem. Reported variance of exactly 0.00 across large samples is flagged as statistically suspicious.
Is consistency the same as determinism?
No. Determinism means identical outputs for identical inputs. Consistency means similar inputs produce outputs within a predictable range. The BTS measures consistency because LLM-based agents are rarely deterministic, but they can and should be consistent.
What is the difference between consistency and anomaly rate?
Consistency measures variance in normal behavior patterns. Anomaly rate measures frequency of extreme outliers. Both contribute independently to the BTS. An agent with consistent normal behavior can still have a high anomaly rate if it regularly produces rare but extreme outlier outputs.