AI Trust Glossary · Canonical Definition
Interpretability
The degree to which a human can understand the internal mechanisms of an AI model - how features, weights, and architecture combine to produce specific outputs.
Explanation
Interpretability asks: can a human inspect the model's workings and understand why it functions the way it does? A linear regression is highly interpretable - you can inspect every coefficient. A large language model is not - its behavior emerges from billions of parameters that resist simple inspection.
Why it matters
Interpretability enables diagnosis. When a model behaves unexpectedly, interpretable models allow engineers to find the root cause. In safety-critical domains, interpretability is sometimes a regulatory prerequisite.
How Borealis uses it
Interpretability informs how decision transparency is measured. Agents with lower interpretability face a higher burden on decision transparency - they must compensate through robust reasoning chains and confidence scoring since their internals cannot be inspected.
See also