Explanation

Explanation is the translation layer of augmented intelligence. It transforms complex representations — whether from a neural network, a dataset, or your own reasoning — into communicable form. Without it, AI output is opaque, and opaque output cannot be trusted.

Advanced · 9 min read

The one idea to keep: A genuine explanation is hard to vary and falsifiable — change one part and it breaks, and it commits to claims that could be proven wrong. AI "explanations" are usually plausible post-hoc text, easy to vary and rarely falsifiable. Do not trust an explanation because it reads well; demand the parts that make it checkable, then check them.

The Explanation Problem

Ask a large language model to explain why it gave a particular answer. It will produce a fluent, confident paragraph. It will sound like an explanation. It will have the grammatical structure of an explanation. But it is not an explanation — it is more generated text.

This is the core problem. AI systems can produce outputs that resemble reasoning, but the process that generated those outputs is fundamentally different from human reasoning. A language model does not "decide" to give an answer and then explain its decision. It generates the answer and the explanation through the same statistical process — predicting the next token in a sequence.

When someone asks you why you chose a particular restaurant, you can trace your actual reasoning: you wanted Thai food, this place had good reviews, it was within walking distance. Each element of the explanation corresponds to an actual factor in your decision. AI explanations do not have this correspondence. The "explanation" is a post-hoc construction that may or may not reflect the computational process that produced the output.

This is not a minor technical issue. It is a fundamental challenge that shapes how you should interact with every AI system you use.

See the difference: two "explanations" for the same flagged transaction

You ask an AI assistant why it flagged a customer's payment as fraud. Both replies below sound confident. Only one is an explanation you can act on.

Plausible post-hoc text

"I flagged this transaction because it showed unusual patterns and a higher-than-normal risk profile based on a range of behavioural signals."

Notice that almost every word can be swapped without changing the feel. "Unusual patterns" could become "atypical activity"; "higher-than-normal risk" could become "elevated exposure". Nothing here makes a claim that could turn out to be false. It explains the flag no matter what the real reason was — which means it explains nothing.

Hard-to-vary explanation

"I flagged it because the amount is 14x this account's median transaction and the merchant country differs from every prior purchase. Remove either condition and the score drops below the threshold."

You cannot vary this without breaking it: each condition does essential work, and it makes a falsifiable prediction — if the account's history actually shows similar amounts or that country before, the explanation is wrong and you will see it immediately.

The first reply was offered. The second only exists because the human demanded the specific conditions, the threshold, and what would have changed the outcome — and then checked the account history to confirm the two claims were actually true.

When a language model "explains" its answer, what is it actually producing?

More generated text — predicted token by token through the same statistical process that produced the answer, not a trace of any decision.

Why does an AI's explanation lack the correspondence that a human's reasoning has?

It is a post-hoc construction that may not reflect the computational process that produced the output, so its parts need not map to any actual factor in the decision.

Genuine Explanation vs Generated Text

David Deutsch's concept of hard-to-vary explanations gives us a precise tool for distinguishing genuine explanation from noise. A genuine explanation has specific properties:

Every component does essential work. You cannot swap parts of the explanation without breaking it. If you can replace "the model hallucinated because the training data was sparse in this domain" with "the model hallucinated because Mercury was in retrograde" and the explanation feels equally plausible, then neither version is actually explaining anything.
It is traceable. You can follow the chain of reasoning back to evidence. A genuine explanation connects its conclusion to observable facts through a chain of logic that you can inspect and test.
It is falsifiable. A real explanation makes predictions. If the explanation is correct, certain things should follow. If those things do not follow, the explanation is wrong. Generated text rarely makes falsifiable claims — it hedges, qualifies, and keeps its options open.

The test: When AI gives you an explanation, ask — can I remove or change any part of this and still reach the same conclusion? If yes, the explanation is easy to vary, and you should not trust it. If removing any element breaks the reasoning chain, you may have a genuine explanation worth building on.

In Deutsch's terms, what makes an explanation genuine rather than easy to vary?

Every component does essential work — you cannot swap any part without breaking it, so the explanation is hard to vary.

What does a genuine explanation make that generated text usually avoids?

Falsifiable predictions: it commits to things that should follow, so it can be shown wrong. Generated text hedges and keeps its options open.

Interpretability and the Black Box

Modern AI systems are often described as black boxes — you put data in, you get results out, but you cannot see what happens in between. This is not a metaphor. A large neural network contains billions of parameters whose interactions are genuinely opaque, even to the researchers who built it.

Interpretability research attempts to open the black box. Techniques like attention visualisation, feature attribution, and mechanistic interpretability try to answer: what patterns is the model using? Which parts of the input influenced the output? What internal representations did the model form?

This work is valuable but limited. Current interpretability methods can tell you which input tokens the model weighted most heavily, but they cannot tell you why those tokens mattered in a way that constitutes a genuine explanation. The gap between "what the model did" and "why the model did it" remains large.

For practitioners, this means you cannot rely on the AI to explain itself. You must build explanation from the outside — by testing the output against known facts, by comparing results across different prompts, by verifying claims independently. The explanation comes from your process, not from the model.

What gap do current interpretability methods still leave open?

The gap between "what the model did" (which tokens it weighted) and "why it did it" — they cannot supply a genuine explanation.

If the model cannot explain itself, where must a practitioner build explanation from?

From the outside, through your own process — testing output against known facts, comparing prompts, and verifying claims independently.

Knowledge Distillation

Knowledge distillation is the process of compressing complex knowledge into simpler, more communicable forms. In machine learning, it literally means training a smaller model to replicate the behaviour of a larger one. But the concept applies far more broadly.

Every time you take a 50-page research paper and extract three key findings, you are performing knowledge distillation. Every time you create a diagram that captures the essential logic of a complex system, you are distilling knowledge. The value is not just compression — it is the creation of understanding.

AI can assist with distillation, but it cannot do it for you. A language model can summarise a paper, but it cannot determine which findings are most important to your specific context. It can generate a diagram description, but it cannot judge whether the diagram captures the essential relationships or misses the crucial one. The distillation requires your understanding of what matters — the AI provides speed, you provide judgment.

Effective distillation produces what this framework calls cognitive artifacts — encapsulated structures of meaning that can be shared, reused, and built upon. The quality of your distillation determines the quality of these artifacts, which in turn determines how effectively you and others can reason about the underlying knowledge.

What is knowledge distillation?

Compressing complex knowledge into simpler, more communicable forms — and in doing so creating understanding, not just shorter text.

Why can't a language model do distillation for you?

It cannot judge which findings matter for your specific context — that requires your understanding. The AI supplies speed, you supply judgment.

Making AI Output Verifiable

If you cannot explain how an AI reached its output, the next best thing is to make the output verifiable — to structure your workflow so that AI claims can be checked against reality before you act on them.

Practical strategies for verifiable AI output:

Demand sources. When AI makes factual claims, ask for specific sources. Then check those sources. AI frequently fabricates citations — the fact that it provides a reference does not mean the reference exists.
Decompose claims. Break complex AI outputs into individual claims and verify each one separately. A paragraph that is 90% correct is still dangerous if the 10% that is wrong is the part you act on.
Use structured output. Ask AI to provide output in structured formats — tables, schemas, step-by-step reasoning — that make individual claims visible and checkable. Freeform prose hides assumptions; structure exposes them.
Cross-reference. Run the same query through different models or different prompting strategies. Where the outputs agree, you have higher confidence. Where they disagree, you have identified an area that requires human investigation.
Test predictions. If the AI's output implies certain things should be true, test those implications. A recommendation that "users will prefer option A" can be tested with actual users. An analysis that "this code will fail under load" can be tested with a load test.

The principle: You do not need to understand how the AI works internally. You need to build a process around the AI that catches errors before they reach production. Explanation is not just about understanding the model — it is about making the entire workflow transparent and correctable.

If you cannot explain how an AI reached its output, what is the next best thing?

Make the output verifiable — structure the workflow so AI claims can be checked against reality before you act on them.

Explanation as a Practice

In the augmented intelligence framework, explanation is not a feature you wait for AI companies to build. It is a practice you develop. Every time you take an AI output and translate it into a form that someone else (or your future self) can understand and verify, you are practising explanation.

This connects directly to meta-cognition — the feedback loop that governs your AI interactions. Meta-cognition asks "am I thinking about this correctly?" Explanation asks "can I communicate what I have found in a way that is traceable and verifiable?" They are complementary disciplines: one governs your internal process, the other governs the output.

Explainer agents — AI systems designed specifically to make other AI systems' outputs understandable — are an emerging capability. But they face the same fundamental challenge: their explanations are also generated text. The human in the loop remains essential, not as a bottleneck but as the only agent in the system capable of genuine understanding.

The trap: "If the AI gives reasons, it explained itself."

Reasons that read fluently are the cheapest thing a language model produces — they come from the same next-token process as the answer, so the model can supply confident-sounding justification for any output, including a wrong one. Fluent reasons are not evidence of a real reasoning chain. The test is not whether reasons were given but whether they are hard to vary and falsifiable: can you remove a part without the conclusion changing, and could the claim turn out to be false? If not, you have generated text wearing the costume of an explanation.

Why isn't "the AI gave reasons" enough to count as a genuine explanation?

The reasons come from the same next-token process as the answer, so the model can justify any output — including a wrong one. What matters is whether the reasons are hard to vary and falsifiable, not whether they were given.

Try it on your own work

Take a recent AI output you relied on — a recommendation, a diagnosis of a bug, a summary's "key reason" — and pressure-test the explanation behind it.

Try to vary it. Rewrite the explanation swapping its specifics for different ones (different cause, different signal, different number). If the new version sounds just as plausible, the original was easy to vary — it was not explaining your case in particular.
Find the falsifiable claim. Ask what would have to be true in the world if the explanation were correct, and what observation would prove it wrong. If you cannot name one, demand specifics — exact conditions, thresholds, sources — until you can.
Check it against reality. Verify those specifics independently: open the cited source, look at the actual data, run the case the explanation predicts. Keep only the parts that survive.

The explanation you can trust is the one left standing after you have tried to break it and failed.

Continue Learning

Explanation is the translation layer. Next, explore how knowledge persists and grows through memory systems.

Memory Systems Back to Learning Hub