How the BM Score Works: Inside Borealis Mark's Trust Methodology
The BM Score is a composite trust rating from 0 to 100 that evaluates the behavioral trustworthiness of AI agents. It's not a performance benchmark, a capability test, or a subjective review. It's a structured, repeatable assessment built on five measurable dimensions.
This article breaks down exactly how those dimensions work, how they're weighted, and why the methodology is designed the way it is.
The Five Dimensions
Every BM Score evaluation assesses an AI agent across five factors. Each factor has a specific weight reflecting its relative importance to overall trustworthiness.
1. Constraint Adherence — 35%
This is the most heavily weighted factor, and deliberately so. Constraint adherence measures whether an AI agent operates within its defined boundaries.
Every well-designed agent has constraints: data it shouldn't access, actions it shouldn't take, domains it shouldn't operate in, escalation thresholds it should respect. Constraint adherence evaluates how reliably the agent respects these boundaries under normal operation, edge cases, and adversarial conditions.
Why 35%? Because an agent that ignores its constraints is untrustworthy regardless of every other quality it might have. A financial analysis agent that respects its data boundaries 95% of the time sounds good — until you realize that 5% failure rate means it's leaking restricted financial data in one out of twenty interactions. Constraint adherence is the foundation. Without it, nothing else matters.
2. Decision Transparency — 28%
Decision transparency measures whether an agent's reasoning is traceable and auditable. This isn't about making AI "explain itself" in plain English to a general audience — it's about producing structured, reviewable decision logs that a technical auditor can follow.
When an agent makes a recommendation, flags a risk, classifies a document, or takes an autonomous action, there should be a clear chain: what inputs were considered, what reasoning was applied, what alternatives were evaluated, and why the final decision was reached.
Why 28%? Transparency is the mechanism that makes everything else verifiable. Without it, you can't confirm constraint adherence, you can't investigate anomalies, and you can't improve the agent's behavior over time. It's the second-highest weight because it enables accountability across all other dimensions.
3. Behavioral Consistency — 20%
Behavioral consistency measures whether an agent produces predictable outputs for similar inputs over time. This isn't about expecting identical responses to identical queries — AI systems have inherent stochasticity. It's about measuring whether the agent's behavior stays within expected variance.
An agent that classifies the same document as "low risk" on Monday and "critical risk" on Thursday — with no change in the document or its context — signals instability. An agent that consistently handles similar customer queries within a predictable range of responses signals reliability.
Why 20%? Consistency is essential for operational trust, but it's downstream of constraints and transparency. An agent can be somewhat inconsistent and still be trustworthy if it stays within its constraints and its reasoning is transparent. But wild inconsistency erodes confidence even when other factors are strong.
4. Anomaly Rate — 15%
Anomaly rate measures how often an agent produces unexpected, flagged, or out-of-distribution outputs. Every AI agent will occasionally encounter edge cases that produce unusual results. The question isn't whether anomalies happen — it's how often and how significant.
A low anomaly rate suggests the agent is operating well within its competence zone. A high anomaly rate suggests it's being deployed in scenarios it wasn't designed for, or that its underlying model has degraded.
Why 15%? Anomalies are important signals but they need context. A slightly elevated anomaly rate in a domain with genuinely ambiguous inputs might be acceptable. A high anomaly rate in a well-defined domain is a red flag. The weight reflects this: anomaly rate informs the score but doesn't dominate it.
5. Audit Completeness — 18%
Audit completeness measures whether the agent's operations are fully logged and reviewable. This is distinct from decision transparency (which measures the quality of decision logs). Audit completeness measures the coverage: are all operations captured, or are there gaps?
An agent might have excellent decision transparency for the decisions it logs — but if it's only logging 60% of its operations, the remaining 40% is a blind spot. Audit completeness closes that gap.
Why 18%? Complete auditability is foundational to trust infrastructure. Without it, the other four dimensions can't be reliably measured. Its weight reflects its role as the operational backbone of the trust evaluation.
How the Score Is Calculated
The five dimension scores are combined using weighted averaging:
BM Score = (Constraint × 0.35) + (Transparency × 0.28) + (Consistency × 0.20)
+ (Anomaly × 0.15) + (Audit × 0.18)Note: The weights sum to 1.16, not 1.0. This is intentional. The methodology uses overlapping weights to reflect the interconnected nature of trust dimensions — particularly audit completeness, which enables measurement of the other four. The raw weighted sum is normalized to the 0-100 scale.
The resulting score places the agent into a tier:
Blockchain Anchoring
Every score update is SHA-256 hashed and committed to Hedera Hashgraph. This means:
This isn't decorative blockchain usage. It directly addresses the fundamental trust problem: if you're asking people to trust a trust score, the score itself must be tamper-proof.
What the Score Doesn't Measure
The BM Score deliberately excludes several things that might seem relevant:
Performance metrics. How fast the agent responds, how accurately it classifies — these are capability metrics, not trust metrics. A fast, accurate agent that ignores its constraints is dangerous, not trustworthy.
Popularity. How widely adopted an agent is has no bearing on its trustworthiness. Market adoption can be driven by pricing, marketing, or network effects — none of which reflect behavioral reliability.
Self-reported metrics. The BM Score is based on independent evaluation, not on what the agent's developer claims. Self-reported trust metrics have an obvious conflict of interest.
Evolving Over Time
The BM Score isn't static. It evolves with each audit cycle as new behavioral data is collected. An agent that maintains strong constraint adherence and transparency over six months will see its score climb. An agent with increasing anomaly rates or declining audit completeness will see its score adjust downward.
This continuous evaluation model means the BM Score always reflects the agent's current operational reality, not a historical snapshot that may no longer be accurate.
Verify any agent's BM Score through the public API at `api.borealismark.com/v1/verify/:agentId`. Register your own agents at borealismark.com.