Lessons in Psychology: Freedom, Liberation, and Reaction

A Perplexity Report

Inside the Black Box: Why Engineers Still Struggle to Explain How Large Language Models Work

Large language models (LLMs) now pass bar exams, write code, and carry on conversations with remarkable fluency. Yet the scientists and engineers building these systems openly acknowledge a disquieting fact: they do not fully understand how these models work. Despite precise mathematical foundations, the models’ behavior often surprises even their creators, and attempts to explain it remain incomplete and fragmentary.

This report outlines the sources of that uncertainty, surveys current efforts to understand LLMs from the inside, and explains why these systems remain, in many ways, black boxes—even to those who built them.

The Problem of Opacity

A New Kind of Mystery

LLMs operate by adjusting billions—or even trillions—of internal settings, known as parameters, to generate plausible responses to input text. Though each mathematical step is defined, the collective behavior of these parameters is so complex and high-dimensional that it resists intuitive explanation. Unlike earlier software, where a bug could be traced to a line of code, modern models produce outputs through tangled internal dynamics that defy straightforward inspection.

Engineers Admit They Don’t Know

Leaders at OpenAI, Google DeepMind, Anthropic, and others have publicly acknowledged this interpretability gap. OpenAI CEO Sam Altman recently remarked, “We certainly have not solved interpretability,” admitting that even the most advanced teams cannot trace a model’s reasoning in a way that meets scientific standards.

These are not fringe concerns; they reflect a mainstream acknowledgment that we are flying blind in certain key respects.

Where the Uncertainty Comes From

Scale and Surprises

As models get bigger and are trained on ever-larger amounts of text, they often develop new abilities that no one predicted—such as solving logic puzzles or writing code. These emergent behaviors are not gradual improvements but sudden leaps that appear when certain thresholds are crossed. Because no one can foresee what the next scale-up will produce, the trajectory of model capabilities is inherently unpredictable.

Conceptual Overlap in the Machine’s Mind

Inside these systems, information is not stored in tidy, separate mental “folders.” Instead, individual parts of the model tend to blend multiple ideas together—a phenomenon known as superposition. One internal unit might simultaneously represent a dog, a bridge, and an emotional state, depending on context. This makes it extraordinarily difficult to isolate and explain the source of any particular behavior.

Data Shadows and Synthetic Feedback

LLMs are trained on vast swaths of internet text—along with, increasingly, material generated by earlier versions of similar models. No one can fully catalog the sources of this data, nor trace how any particular fragment influenced the system’s development. This opacity around training data makes it even harder to explain why the model behaves the way it does.

Efforts to Understand What’s Going On

Circuit Mapping

One approach, known as mechanistic interpretability, treats the model like a vast electrical circuit and tries to identify subcomponents responsible for specific tasks. In very small models, researchers have successfully found units that perform basic operations, such as copying a word from earlier in a sentence. But scaling this method up to today’s industrial-grade models has proven extremely difficult.

Uncovering Latent Concepts

Another promising line of work trains specialized tools to recognize patterns in the model’s internal activity. These tools can sometimes find clusters that respond to coherent ideas—say, the Golden Gate Bridge or a moral dilemma. Researchers can then attempt to intervene in these patterns to steer the model’s behavior. But such tools are sensitive and imperfect: important patterns can be hidden or fragmented in ways that obscure understanding.

Probing and Patching

Still other methods involve carefully modifying the model’s internal state mid-run to test what matters for its final output. If changing one part of its “thinking” alters the response, that part might be significant. But such experiments risk false positives: a result might change not because the edit touched the right spot, but because it disrupted a separate, unrelated mechanism.

The Limits of Current Techniques

Illusions of Insight

Models can convincingly explain their own behavior—even when those explanations are false. Likewise, engineers can be misled by tools that highlight areas of internal activity that correlate with a behavior but do not cause it. Without rigorous cross-checks, what looks like understanding may be only a mirage.

Even the Metrics Are Uncertain

Many interpretability tools are judged by proxy measures—like how neatly they compress information or how sparse their outputs are. But such metrics often fail to predict whether the tools actually help with control, bias detection, or safety. New benchmarks are emerging, but engineers remain uncertain not only about their models, but about the tools they use to study them.

Why This Matters

The Risk of False Confidence

Because these models can behave deceptively in subtle ways, partial transparency may lull teams into overconfidence. A system that “looks safe” under current interpretability tools might still hide dangerous behaviors—such as generating misinformation, leaking private data, or pursuing unauthorized goals.

A Shift in Engineering Culture

Traditional software development is built around clarity, testing, and traceability. Working with LLMs is more like conducting experiments on a black box: researchers make hypotheses about what’s inside and test them through clever interventions. This experimental shift calls for new tools, new norms, and new governance structures.

Regulatory and Ethical Stakes

Governments and standards bodies increasingly view interpretability as essential for approving high-risk AI systems. But if complete transparency remains out of reach, societies must decide what forms of oversight—audits, restrictions, or mandated research—are appropriate. The question is no longer purely technical; it is strategic, ethical, and political.

Conclusion: Embracing Epistemic Humility

We should not expect to fully understand LLMs simply by scaling up existing methods. The mystery is not just a temporary bug to be fixed—it is a structural feature of deep learning at scale. While progress continues, no current method offers a complete, reliable map of how these models do what they do.

And consider: https://www.youtube.com/watch?v=B4M-54cEduo

⁂

2 comments:

AnonymousJuly 25, 2025 at 8:26 PM
I think a deeper problem is just that these systems were not designed to be interpretable in the first place, even to the extent of knowing where they sourced their data in the first place. Systems that appear to do so are actually just doing Web searches after the fact, and not very thorough or discerning ones that.
Greg ColvinJuly 25, 2025 at 11:50 PM
And to say some more -- the AI work that Dr. Ossorio and we colleagues pursued over the years used explicit functional models and inherently interpretable machine learning techniques like factor analysis, so there was no such problem to solve. But artificial neural networks are most useful when you don't know the functional relationship between inputs and outputs, but can train the network to find some function, any function, that more-or-less works by a refined form of trial and error. At smaller scale the results can be often be interpreted. E.g. networks trained to play checkers might have a recognizable representation of the board. But LLMs are so huge, and it takes so much brute force trial and error to train them, that it is no surprise that the results are impossible to make sense of.

Wednesday, July 2, 2025