Literature Review: Mechanistic Interpretability: Understanding Neural Network Internals

Mechanistic interpretability, which systematically reverse-engineers neural network computations, is a critical response to demands for transparent AI. This review synthesizes over 50 studies, identifying key thematic clusters including sparse autoencoders, circuit-level analysis, and the study of training dynamics. Findings reveal that language models encode structured, often linear representations within identifiable components, linking specific capabilities and failures to particular circuits. Methodological advances like sparse autoencoders and causal mediation have uncovered modular pipelines for tasks like arithmetic and factual recall. Research into training dynamics shows learning is non-linear, marked by discrete organizational events such as grokking and feature sharpening, where generalization corresponds to the consolidation of interpretable subnetworks. Furthermore, mechanistic analysis has begun to ground behavioral failures like hallucination in internal structures, informing AI safety through techniques like activation steering. However, significant barriers remain, including combinatorial complexity, representation superposition, and a lack of automated discovery tools, which impede scaling to frontier models. The field also requires formal causal frameworks to validate that discovered circuits genuinely implement algorithms. The review concludes that while tractable, closing the gap between isolated circuit analysis and comprehensive model auditing is an urgent imperative for safety evaluation.

1. Introduction

The opacity of modern neural networks represents one of the most consequential epistemological challenges in contemporary science and technology. As artificial intelligence systems assume increasingly central roles in consequential decisions — from clinical diagnosis to legal reasoning, from financial modeling to the generation of public information — the inability to reliably explain what these systems compute, how they arrive at outputs, and why they fail in the ways they do has emerged as both a scientific problem and a societal concern [1, 2]. Mechanistic interpretability, the subfield dedicated to reverse-engineering the internal computations of neural networks, has advanced rapidly in recent years, moving from early attribution methods toward increasingly precise accounts of the circuits, representations, and algorithms that underlie model behavior [3, 4]. This systematic review synthesizes studies spanning the period from 1994 to 2025, tracing the arc of the field from its foundational techniques to its current methodological frontier, and identifying where critical questions remain unresolved.

The study of neural network internals did not begin with contemporary large language models. For decades, researchers in computational neuroscience and machine learning have investigated whether artificial networks learn representations analogous to those observed in biological neural systems, and whether the internal structure of trained networks can be made legible through careful analysis [5, 6]. Early gradient-based and attribution methods offered partial windows into model behavior [7, 8], establishing that internal activations were not arbitrary but organized in ways that tracked meaningful structure in the training data. These classical contributions laid the empirical and conceptual groundwork upon which more recent mechanistic approaches now build. What distinguishes the current moment, however, is a combination of scale, ambition, and methodological innovation that marks a genuine inflection point in the field.

The emergence of transformer-based large language models capable of sophisticated linguistic and reasoning behavior has made the interpretability problem both more urgent and, paradoxically, more tractable in certain respects [9, 10]. Larger models, trained on richer data, appear to develop more structured internal representations [11, 12], enabling researchers to identify interpretable features, circuits, and information-routing mechanisms [13, 14] that were more difficult to isolate in smaller, earlier architectures. At the same time, scale introduces new challenges: the sheer dimensionality of these systems strains the tools developed for smaller networks, and the question of whether mechanistic findings from well-studied cases generalize across architectures and scales remains substantially open. The field is therefore at a moment of productive tension — accumulating genuine mechanistic insight while confronting the limits of its current methods.

A further dimension of urgency has emerged from the AI safety and alignment community, which has increasingly recognized that behavioral evaluation alone is insufficient to detect critical failure modes such as deceptive alignment and goal misgeneralization [10]. If a model can internally represent objectives that diverge from its training signal while producing compliant outputs under evaluation conditions, then no amount of behavioral red-teaming can certify its alignment [15]. This recognition has reframed mechanistic interpretability not merely as a scientific endeavor but as safety infrastructure — a necessary complement to behavioral testing in any credible alignment evaluation pipeline. The development of representation engineering and activation steering techniques, which allow researchers not only to observe but to intervene upon internal model states, has made this safety framing increasingly concrete. Complementing this static analysis, a rapidly maturing body of work on training dynamics and developmental interpretability has begun to address a related question: not only what circuits exist in a trained model, but when and how those circuits form during training — and whether the process of circuit formation can itself be monitored as a safety-relevant diagnostic [16, 17].

This review is organized around seven thematic clusters that together map the intellectual landscape of neural network interpretability. The first concerns the methodological frontier represented by sparse autoencoders, circuit-level mechanistic interpretability, and direct circuit discovery [18, 19, 20], examining how these tools decompose network activations into interpretable features and trace the computational pathways responsible for specific model behaviors. The second cluster addresses training dynamics, developmental interpretability, and learning phase transitions [16], surveying how circuits and representations crystallize during training through discrete organizational events rather than smooth optimization. The third cluster addresses probing and empirical analysis of language model internals [11, 21], surveying the experimental traditions that have sought to characterize what syntactic, semantic, and factual knowledge is encoded in model representations, and how that knowledge is organized across layers and attention heads. The fourth theme situates artificial neural networks within computational neuroscience and the study of biological cognition, asking whether network internals reflect principles of brain computation and what this correspondence — or its absence — implies for both fields. The fifth cluster examines the classical and gradient-based explainability methods that predated the mechanistic turn [6, 7, 1], assessing their contributions, limitations, and ongoing relevance. The sixth theme encompasses surveys, taxonomies, and synthetic frameworks for explainable artificial intelligence [22, 23], which provide the meta-level scaffolding within which individual empirical contributions can be situated and evaluated. The seventh and final cluster addresses the application of mechanistic interpretability to AI safety and alignment [10, 15], examining how interpretability findings connect to alignment threat models, what formal limits constrain verification, and how representation engineering operationalizes interpretability for behavioral control.

The review addresses five central research questions: what methods have proven most effective for identifying meaningful circuits and features in neural networks [3, 24]; what interpretability studies have revealed about how language models represent and process information [12, 25]; how findings about model internals connect to observed model capabilities and failures; how the temporal dynamics of training — including phase transitions, grokking, and circuit formation [16] — relate to the interpretability and safety properties of trained models; and what open problems currently confront efforts to scale interpretability methods to the largest and most capable models in existence [26, 20]. By synthesizing evidence across these dimensions, this review aims to offer a coherent account of where the field stands, what it has established, and where the most significant intellectual work remains to be done.

2. Methodology

The literature search underpinning this review was conducted exclusively through OpenAlex, using a series of targeted keyword queries designed to capture the principal dimensions of mechanistic interpretability research. The five queries spanned complementary aspects of the field: foundational circuit and feature analysis methods, internal representations in language models, the relationship between interpretability and model capabilities or failures, scaling challenges in large language models, and survey-level treatments of deep learning interpretability. By distributing the search across these conceptually distinct but overlapping framings, the strategy was intended to surface both methodological and applied contributions without over-indexing on any single strand of the literature.

Search and Candidate Filtering

Keyword search across the five queries returned an initial pool of 250 candidate papers. These candidates were filtered using a relevance scoring threshold of 0.6, yielding 18 papers that met the minimum standard for topical alignment with mechanistic interpretability — a field whose scope spans circuit analysis, feature decomposition, and causal intervention methods [10].

A citation network expansion was then initiated, though in practice this expansion did not retrieve additional papers via forward or backward citation traversal in the conventional sense. Nevertheless, the expansion stage identified 5 further relevant papers while rejecting 11, producing a coverage delta of 0.31 before the process was terminated upon reaching the collection target set for this review. That target — a candidate pool sufficient to support a final corpus of 30 papers — was operationalised as 90 candidates, reflecting a working ratio intended to preserve meaningful selection pressure during downstream filtering.

Quality filtering was applied consistently across the candidate set. Papers not published within the most recent two years were required to have accumulated at least 5 citations, functioning as a minimum proxy for community engagement. This criterion was calibrated against the relatively short citation half-life typical of fast-moving deep learning subfields [2, 27]. To ensure the corpus remained responsive to a fast-moving field, a recency quota stipulated that at least 35% of included papers date from within the two-year window — a constraint that shaped final selection without mechanically excluding older foundational work such as seminal contributions on circuits [3] and superposition [24].

Retrieval and Processing

All 30 papers in the final corpus underwent full-text analysis; no papers were assessed solely from abstracts or metadata, and no retrieval failures affected the final set. When three papers encountered acquisition failures during retrieval, they were replaced with alternative papers drawn from the existing candidate pool. This approach preserved both the corpus size and the integrity of the selection process.

The resulting corpus spans publications from 1994 to 2025, reflecting the dual necessity of grounding the review in early theoretical contributions to neural network analysis [28] while capturing the current frontier of mechanistic interpretability work [3, 10, 4]. Much of this recent work has emerged alongside the rapid scaling of language models [9, 17]. This temporal breadth also accommodates the broader interpretability literature [2], within which mechanistic approaches have increasingly distinguished themselves as a distinct and rigorous subfield [16].

Supplementary Literature Integration

Following completion of the primary review, a supplementary literature search was conducted to specifically examine the intersection of mechanistic interpretability with AI safety, alignment, and representation engineering [10]. This targeted expansion identified eleven additional papers published between 2023 and 2025 that address several key areas: how interpretability methods connect to alignment threat models [29], formal limits on verification [30], behavioral risk phenomena including hallucination [31, 32] and manipulation, and the operationalization of interpretability findings through activation steering [33, 34] and representation engineering. These supplementary papers were integrated into the existing thematic structure where they complemented or extended the primary corpus, and they motivated the addition of a thematic cluster specifically addressing safety and alignment applications.

A second supplementary search was subsequently conducted, this time targeting the intersection of mechanistic interpretability with training dynamics, developmental interpretability, and learning phase transitions. This expansion identified thirteen additional papers published between 2001 and 2025 that address several important topics: how circuits and representations form during training, the phenomenon of grokking and its mechanistic decomposition [16] — wherein models abruptly transition from memorization to generalization — direct circuit discovery through causal mediation and intervention [35, 14, 36], mathematical foundations of transformer dynamics [37], biological analogues to computational learning principles [38], and methodological challenges in probing and feature visualization [39]. These additional papers were integrated across existing thematic clusters and motivated the addition of a dedicated section on training dynamics and developmental interpretability. The expanded corpus thus comprises over 50 papers, preserving the original review’s analytical framework while extending its coverage to encompass both the static analysis of trained models and the temporal processes through which their internal structure emerges.

Thematic Organisation

The papers were organised into seven thematic clusters, a structure that maps naturally onto the conceptual architecture of the field [10] — encompassing circuit-level analysis [3, 36], feature visualisation [24, 18], training dynamics and developmental interpretability [37], representational geometry [40], interpretability methods at scale [26], the relationship between internal structure and observable model behaviour [19], and the application of mechanistic interpretability to AI safety and alignment evaluation [10, 15]. This clustering was applied after full-text analysis and shaped the synthetic framing adopted in the review sections that follow, allowing findings to be presented in terms of substantive research programmes rather than as an undifferentiated catalogue of individual studies.

3. Sparse Autoencoders, Circuit Discovery, and Mechanistic Interpretability

Mechanistic interpretability research has converged on a central technical challenge: neural networks do not organize information in ways that align naturally with human-interpretable concepts. The activations of individual neurons are typically polysemantic, responding to multiple unrelated inputs, which means that understanding a network’s computations requires decomposing its representations into a more interpretable basis. Sparse autoencoders (SAEs) and related tools have emerged as the primary methodological response to this challenge, forming what might be described as the technical engine of the mechanistic interpretability program. Complementing these decomposition methods, a parallel strand of work has pursued direct circuit discovery through causal intervention, tracing the specific algorithms and computational pipelines that models implement for tasks ranging from arithmetic to factual recall. The trajectory of this field moves from theoretical characterization of why interpretability is hard, through the development and scaling of dictionary learning methods, to their deployment across diverse architectures and even biological domains, to the identification of specific algorithms in model weights, and increasingly toward the operationalization of discovered features for safety-relevant behavioral intervention.

Superposition and Polysemanticity: The Theoretical Baseline

Early theoretical work on toy models of superposition provided a precise framework for understanding the foundational obstacle of polysemanticity [24]. Using small ReLU networks trained on synthetic sparse inputs, this research demonstrated that superposition is not merely an artifact of insufficient model capacity, but rather a genuine and rational strategy that networks adopt when the number of features in the environment exceeds the number of available neurons. The key insight is geometric: features can be packed into activation space by representing them as nearly-orthogonal directions. This approach exploits the mathematical fact that high-dimensional spaces admit exponentially many near-orthogonal vectors [3], allowing models to tolerate a small amount of interference in exchange for dramatically expanded representational capacity.

Sparsity serves as the critical driver of this phenomenon. As the frequency with which features co-occur decreases, models undergo sharp phase transitions from monosemantic neurons (where one neuron represents one feature) to polysemantic coding schemes where each neuron participates in representing many features simultaneously [24]. This polysemanticity had been identified as a concrete obstacle to mechanistic analysis even before this theoretical framing, with researchers observing neurons that respond to multiple unrelated inputs—a phenomenon that compounds the difficulty of tracing circuits through a network [3].

The theoretical grounding established by this work carries profound implications for interpretability. It demonstrates that the neuron basis is, in general, the wrong basis for understanding network computation [10]. Consequently, recovering the true feature basis requires methods capable of looking beyond individual neurons entirely [18].

Sparse Autoencoders as Dictionary Learning Solutions

Sparse autoencoders (SAEs) provide a practical solution to the superposition problem by learning an overcomplete dictionary of sparse features that better aligns with the underlying representational geometry [24]. Early empirical work demonstrated that SAEs trained on the residual stream activations of language models like Pythia recover features that are substantially more interpretable than neurons, random directions, PCA components, or ICA components, as measured by automated “autointerpretability” scoring [18]. This scoring approach—using a language model to generate descriptions of features and then evaluating how well those descriptions predict feature activations on held-out inputs—established an important methodological precedent for scalable evaluation [18, 41]. Beyond interpretability, features recovered by SAEs proved more causally useful: activation patching experiments on the Indirect Object Identification task [14] required fewer patches and smaller-magnitude edits when using SAE features compared to PCA components [18], providing early evidence that the features were tracking something computationally meaningful rather than merely correlating with outputs. Subsequent principled evaluations have further scrutinized these causal claims, proposing stricter benchmarks for what it means for a decomposition to be both interpretable and causally sufficient [41].

Following these initial successes, subsequent work tackled the scaling and training challenges that would limit the applicability of early SAE approaches. A significant contribution was the introduction of the TopK activation function as a replacement for the standard ReLU with L1 penalty [42]. The L1 penalty approach requires careful tuning of a regularization coefficient to balance sparsity against reconstruction fidelity, and different coefficient values produce incommensurable models. TopK eliminates this hyperparameter sensitivity by directly constraining the number of active features per input, yielding better performance on the sparsity-reconstruction Pareto frontier while simplifying the training procedure [42]. However, this approach introduces its own trade-off: fixing the number of active features is itself a hyperparameter choice that encodes assumptions about the uniform sparsity of all inputs, and the debate over whether one has genuinely escaped hyperparameter sensitivity or merely displaced it remains active [41]. The scaling results in this work were substantial—a 16-million-latent SAE trained on GPT-4 residual stream activations achieved only 7% dead latents [42], demonstrating that dictionary learning could be pushed to frontier-model scale without the catastrophic feature collapse that had plagued smaller experiments. Complementary large-scale efforts, including Gemma Scope [20] and LLaMA Scope [26], have since extended this approach across diverse model families, further validating the scalability of sparse dictionary methods.

Architectural Alternatives: Transcoders

The transcoder framework offers a conceptually distinct alternative to SAEs [19]. While SAEs decompose a representation—specifically, the activation at a given layer—into a sparse combination of learned features [18], transcoders instead approximate the computation performed by an MLP block. This approach learns a mapping from input to output that can be expressed as a sparse combination of interpretable operations. This distinction is significant because it fundamentally changes what the discovered features represent: SAE features are constituents of a representation, whereas transcoder features are constituents of a computational process.

A crucial advantage of transcoders is that their attributions factorize into two components: an input-invariant term derived purely from model weights that describes general feature relationships, and an input-dependent term reflecting a feature’s activation on a specific input [19]. This factorization has important implications for circuit-finding. Transcoder-based feature circuits directly characterize how information is transformed rather than merely how it is stored, and they sidestep the input-invariance limitation inherent to gradient-based attribution over SAE activations [19].

Empirical comparisons between the two approaches provide valuable insights. Across GPT2-small, Pythia-410M, and Pythia-1.4B—evaluated on a common dataset of 3.2 million tokens—transcoders matched or outperformed SAEs on the sparsity-fidelity Pareto frontier, with the performance gap appearing to widen on larger models [19]. More revealing was a blind evaluation in which human raters judged 41 of 50 transcoder features interpretable versus 38 of 50 SAE features—a non-significant difference suggesting that the two approaches offer comparable interpretability despite their different architectural assumptions [19].

Despite these findings, the fundamental question remains unresolved: should we decompose activations or approximate computation? This choice has significant implications for how discovered circuits should be interpreted causally, and the debate continues.

Extending SAEs to Attention Layers

Early SAE research focused predominantly on residual stream or MLP activations, leaving the attention mechanism—which routes information between token positions—relatively unaddressed [18]. Subsequent work addressed this gap by training SAEs on concatenated pre-linear attention outputs, enabling decomposition of attention layer outputs into sparse, interpretable features [43]. The resulting analysis yielded striking findings: weight-based head attribution revealed that approximately 90% of attention heads in GPT-2 Small are polysemantic, performing multiple distinct tasks depending on context [43]. This finding contrasts with earlier analyses that suggested only a small subset of heads perform specialized, identifiable functions such as positional or syntactic attention [13].

These results extend the superposition thesis [24]—which posits that networks represent more features than they have dimensions by exploiting sparse co-activation—from the representational to the computational level. This extension suggests that attention heads are not modular in the way that intuition about “attention patterns” might imply. Consequently, SAE decomposition of attention outputs offers a route to more precise attribution of which computations are performed where, complementing residual stream decomposition.

Community Infrastructure: Gemma Scope

A persistent limitation of early SAE research was the difficulty of replicating or extending results without access to the substantial computational resources required to train high-quality dictionaries [18]. The release of Gemma Scope addressed this bottleneck by providing over 400 SAEs covering every layer and sublayer of the Gemma 2 2B and 9B models, trained using JumpReLU activation functions on TPUv3 and TPUv5p hardware, with more than 30 million learned features in total [20]. By making these resources publicly available, the project substantially lowered the barrier for community researchers to engage in mechanistic interpretability work without re-solving the training infrastructure problem. Subsequent efforts such as Llama Scope — which trained 256 SAEs across every sublayer of Llama-3.1-8B at multiple feature widths — similarly aimed to establish shared, open-source resources that serve as a common language for the research community [26]. This kind of open tooling serves an important function in a field where the evaluation of SAE quality depends heavily on the scale and comprehensiveness of the feature vocabulary [42].

Cross-Domain Transfer: Protein Language Models

The most recent development in the SAE cluster concerns the transfer of SAE methodology to biological sequence models—a domain where ground truth interpretability criteria are, in principle, more accessible than in natural language. Work applying SAEs and transcoders to ESM2 protein language model representations demonstrated that SAE features achieve median Pearson correlation scores of approximately 0.70 for protein-level representations, substantially outperforming raw ESM neurons (~0.50) [44]. Consistent with the results from language model comparisons, transcoders showed comparable performance to SAEs in this domain as well, with median scores ranging from 0.66 to 0.73 across layers [44]. This cross-domain application is significant for several reasons. First, it suggests that the superposition hypothesis—originally formulated in the context of toy models showing that neural networks store more features than they have dimensions when those features are sparse [24]—and the SAE solution are not idiosyncratic to language models but may reflect something general about how neural networks trained on sparse, high-dimensional data organize their representations. Second, the existence of biological ground truth—protein function annotations, structural features, evolutionary conservation—provides an independent validation criterion that is largely absent in natural language tasks, opening new avenues for evaluating whether SAE features are causally sufficient rather than merely correlational [41]. The protein domain therefore offers a rare opportunity to benchmark interpretability methods against objective biological labels, potentially resolving longstanding debates about the faithfulness of learned sparse decompositions [18].

Direct Circuit Discovery: Algorithms, Factual Recall, and World Models

While SAEs and transcoders address the decomposition problem — recovering interpretable features from superposed representations — a complementary strand of mechanistic interpretability pursues the tracing problem: identifying the specific computational pipelines through which models perform identifiable tasks. This line of work applies causal intervention methods directly to model components, asking which attention heads, MLP layers, and residual stream positions are necessary and sufficient for particular computations.

Stolfo, Belinkov, and Sachan applied causal mediation analysis to arithmetic reasoning in pretrained transformers, treating individual model components as mediators in a causal graph and measuring their contribution to correct outputs [35]. Their analysis revealed a clean division of computational labour: early MLP modules at operand token positions consistently encode the operands themselves, while mid-to-late MLP modules at the final token position encode the computed result rather than the inputs [35]. This spatial and functional specialization — operand representation upstream, result computation downstream — demonstrates that arithmetic is not performed by diffuse, whole-model computation but by identifiable subnetworks with distinct roles. Causal mediation analysis proved particularly well-suited to this investigation because it enables researchers to distinguish components that merely correlate with correct outputs from those that are causally necessary for them, a distinction that earlier attribution-based methods could not cleanly draw.

Parallel work on factual association retrieval arrived at structurally similar conclusions through independent methods. Geva and colleagues applied attention knockout and MLP patching to trace how language models recall factual associations of the form “The Eiffel Tower is located in [Paris],” revealing a three-step pipeline: subject enrichment occurs at early MLP sublayers, where the representation of the subject entity is elaborated with attribute-relevant information; relation propagation then carries this enriched representation to the final sequence position; and attribute extraction is performed by upper multihead self-attention layers, which promote the target attribute from approximately rank 999 to the top-1 position in the output distribution [25]. The role assigned to attention in this final extraction step is noteworthy, as it complicates earlier accounts that emphasized MLP layers as the primary locus of factual knowledge storage — notably the influential proposal by Dai and colleagues that feed-forward network neurons function as key-value memories storing relational facts, and that suppressing identified “knowledge neurons” produces an average 29% decrease in correct output probability [45]. Geva et al.’s findings suggest instead a division of labour in which MLPs encode and enrich while attention heads select and retrieve [25]. The three-step pipeline structure itself — enrichment, propagation, extraction — echoes the multi-stage organization observed in arithmetic circuits [35], reinforcing the hypothesis that modular, sequential computation may be a general organizational motif in transformer internals rather than a domain-specific curiosity. This motif has been corroborated by circuit-level analyses of syntactic tasks: Wang and colleagues identified a sparse circuit of attention heads in GPT-2 Small responsible for indirect object identification, demonstrating that individual task-relevant heads can be isolated and causally validated across structurally varied prompts [14].

A distinct but complementary line of evidence concerns the geometry of the representations these circuits operate upon. Nanda, Lee, and Wattenberg investigated OthelloGPT, an autoregressive transformer trained exclusively to predict legal Othello moves with no explicit supervision on board state [46]. Despite this purely behavioral training objective, the model spontaneously develops linearly decodable representations of board state by layer four, and crucially, it encodes the board in player-relative categories (MINE, YOURS, EMPTY) rather than absolute colours (BLACK, WHITE, EMPTY) — a representational choice that generalizes more naturally across game positions [46]. This finding establishes that linearly structured world models can emerge from sequence prediction alone, without any supervision that explicitly rewards geometric regularity. The convergence between modular computation (as revealed by causal intervention) and linear representational geometry (as revealed by probing) suggests that the tools of linear algebra and causal intervention together constitute a surprisingly powerful interpretability toolkit — a point that has been articulated philosophically as well as empirically. Kästner and Crook argue that mechanistic interpretability’s power derives precisely from its capacity to combine multiple coordinated discovery strategies — intervention, probing, circuit tracing — into a unified investigative programme that transcends the limitations of any individual method applied in isolation [4].

From Feature Discovery to Safety-Relevant Intervention: Representation Engineering

The progression from feature discovery to behavioral intervention represents the most direct connection between sparse autoencoder research and AI safety. Where the preceding subsections characterize SAEs as tools for understanding model internals, a growing body of work frames mechanistic interpretability explicitly as infrastructure for ensuring that model behavior aligns with intended objectives [10]. Bereska and Gavves articulate this framing precisely: understanding what computations a model actually performs is treated not merely as a scientific goal but as a prerequisite for any credible claim about whether the model’s internal objectives align with designer intent [10]. The superposition and polysemanticity phenomena reviewed above are, on this account, not just obstacles to scientific understanding but direct threats to safety assurance — if internal representations are entangled and opaque, then hidden misalignment cannot be ruled out through behavioral testing alone [29].

The practical expression of this safety-motivated framing is representation engineering — the use of interpretability-derived knowledge about internal representations to construct targeted interventions on model behavior. The most developed technique in this space is activation steering, in which steering vectors derived from contrastive activation pairs — computed by comparing model activations on paired inputs that differ along a semantically meaningful dimension (e.g., honest versus deceptive responses) — are added to forward-pass activations at inference time to shift model behavior along the target dimension without retraining [10, 33]. This approach operationalizes the causal intervention paradigm that SAE research has developed theoretically: rather than merely observing that certain features correlate with behaviors, activation steering directly manipulates those features and measures the downstream behavioral consequences. Empirically, steering vectors have been shown to reinforce instruction adherence in models such as Phi-3, Gemma 2, and Mistral 7B across format, length, and content constraints — with vectors computed on instruction-tuned models demonstrating effective transfer to base model variants, suggesting that instruction representations generalize across model families [33]. The technique has demonstrated reliable modulation of model behavior along dimensions including honesty, refusal, and sentiment, and category-specific safety steering methods have further extended this capability by computing harm-category-distinct vectors from harmful versus harmless activation pairs, with activation pruning (retaining differences above the median L2 norm) shown to improve the safety–utility trade-off relative to naive vector application [34]. Collectively, these results position activation steering as a viable safety intervention that is faster and more transparent than fine-tuning-based approaches.

The safety implications of this capability are substantial but double-edged. On one hand, activation steering offers a mechanism for enforcing behavioral constraints — such as refusal of harmful requests or expression of calibrated uncertainty — that does not require the computational expense of fine-tuning and can be applied or removed at deployment time [34]. On the other hand, a steering intervention that suppresses a target behavior at the output level may leave the underlying computational substrate that generated it intact. This distinction is consequential for the problem of deceptive alignment, in which a model internally represents a goal that diverges from its training objective but has learned to suppress its behavioral expression under conditions that resemble evaluation [10, 29]. A model whose deceptive circuits have been masked by activation steering, rather than identified and removed, has not been verified as aligned — it has merely been constrained. Mechanistic interpretability and behavioral intervention are therefore not sequential stages of a safety pipeline but must be pursued as a coupled system, with feature discovery informing intervention design and intervention results feeding back into mechanistic understanding.

Causal Frameworks for Evaluating Mechanistic Claims

The recurring question across all SAE and circuit-discovery work — whether identified features and circuits are causally sufficient explanations of model behavior or merely interpretable correlates — cannot be resolved without a formal account of what causal sufficiency means in the context of neural network computation. The tools reviewed thus far — activation patching, feature ablation, transcoder circuit extraction, activation steering, causal mediation analysis — perform interventions on model internals, but the theoretical criteria against which these interventions should be evaluated have remained largely implicit. Recent work on causal representation learning provides the formal grounding that the mechanistic interpretability program requires.

Schölkopf et al. articulated a framework for causal representation learning grounded in structural causal models (SCMs), arguing that the Independent Causal Mechanism (ICM) principle — which holds that autonomous causal mechanisms neither inform nor influence one another — provides a principled basis for decomposing complex systems into modular, independently manipulable components [47]. The ICM principle carries a direct implication for circuit discovery: a feature or circuit qualifies as a genuine causal component of a model’s computation to the extent that it can be intervened upon independently of other components, with predictable and localized downstream effects. This criterion goes beyond mere correlation or even predictive accuracy; it demands that the hypothesized mechanism be separately manipulable in a way that produces the expected change without disrupting unrelated computations. When activation patching on SAE features produces targeted, predictable changes in model output — as demonstrated for indirect object identification [18] — or when causal mediation analysis isolates the specific MLP layers responsible for operand encoding versus result computation [35] — the results are consistent with this interventionist criterion. But consistency is not proof: the intervention may operate through correlated pathways rather than the hypothesized mechanism, and as Schölkopf et al. emphasize, distribution shifts caused by interventions can expose failures of purely associative models that perform well under observational data [47]. Analogously, an SAE feature that appears interpretable under normal forward passes may fail to behave as a genuine causal unit when subjected to targeted ablation or amplification in novel or out-of-distribution contexts.

This epistemological concern has been given systematic philosophical treatment. Kästner and Crook argue that earlier explainability approaches adopted a “divide-and-conquer” strategy — decomposing model behavior into locally interpretable fragments — that is structurally incapable of delivering the generalizable understanding required for safety and trust [4]. Mechanistic interpretability’s alternative — what they term a “coordinated discovery” strategy oriented toward emergent functional structure — represents a qualitatively different epistemological programme, one that seeks to uncover how circuits interact to produce behavior rather than merely cataloguing individual components [4]. This philosophical framing clarifies why the ICM principle is the appropriate evaluative standard: it demands precisely the kind of modular, independently manipulable decomposition that coordinated discovery aims to produce.

The urgency of this formal grounding is underscored by results from computability theory. De Melo and colleagues have formally demonstrated that the inner alignment problem — the question of whether a trained model has genuinely adopted the intended objective rather than a proxy — is undecidable for arbitrary AI models, established via reduction to Rice’s Theorem and the Halting Problem [30]. This result places a hard ceiling on what post-hoc behavioral evaluation can achieve: no general algorithm can determine whether an arbitrary model is aligned. The constructive implication, however, is that alignment can be guaranteed for a countable set of architecturally constrained systems built compositionally from provably aligned base operations [30]. Mechanistic interpretability participates in this formal agenda by seeking to make internal computations sufficiently transparent that alignment properties become inspectable, even when they cannot be formally verified for arbitrary systems. The causal frameworks reviewed here provide the criteria against which such inspections should be evaluated.

A closely related methodological concern is highlighted by the broader literature on explanation methods. Samek et al. demonstrated that different attribution techniques — including occlusion, integrated gradients [7], and layer-wise relevance propagation [6] — can yield meaningfully different explanations for the same model and input, and that these explanations vary in faithfulness to the model’s actual decision process [1]. Their finding that explanation methods can reveal “Clever Hans” effects — where models exploit spurious correlations rather than meaningful features — underscores the danger of treating any decomposition method’s outputs as automatically reflecting genuine computational structure [1]. The same caution applies to SAE features: without interventionist validation, there is no guarantee that a feature identified as “interpretable” by automated scoring corresponds to a mechanism the model actually uses, rather than a spurious regularity that happens to be legible. Harradon et al. similarly highlight this risk, showing that causal explanation of deep networks via autoencoded activations requires explicit interventionist grounding to move beyond associative descriptions [48].

The need for such formal grounding is reinforced by the foundational distinction between explanatory and predictive modeling articulated by Shmueli [49], who argues that inferring explanatory power from predictive accuracy is a methodological error documented across multiple fields. In the interpretability context, this caution applies directly: high reconstruction fidelity of an SAE, or high classification accuracy of a probing classifier, does not by itself establish that the discovered features causally explain model computation. Explanatory adequacy requires interventionist evidence — showing that manipulating the putative cause produces the predicted effect — rather than merely demonstrating that features correlate with or predict behavior. The integration of causal reasoning with deep learning — what has been termed “deep causal learning” [50] — remains an active research frontier, but its relevance to mechanistic interpretability is direct: the field needs formal methods for distinguishing features that participate in a model’s causal graph from features that are epiphenomenal artifacts of the decomposition method.

These frameworks collectively supply the evaluative criteria that the SAE and circuit-discovery literature has thus far lacked. A fully rigorous mechanistic interpretability claim would require demonstrating not only that a feature is interpretable and that intervening on it changes model output, but that the intervention satisfies the structural requirements of an SCM-grounded causal account: independence of mechanisms, predictability of intervention effects across contexts, and robustness to distribution shift. No current study in the SAE literature meets this standard in full, though the activation patching paradigm [18], transcoder circuit extraction [19], causal mediation analysis [35], and activation steering [10] represent meaningful steps in that direction. Formalizing these criteria and developing scalable methods for testing them is among the most consequential theoretical tasks facing the field.

Limitations and Open Questions

Despite rapid progress, several structural limitations remain unresolved. The causal sufficiency of SAE-discovered features—whether they explain model behavior or merely correlate with it—is under-explored across all applications [44, 19], and as the preceding discussion of causal frameworks makes clear, the field currently lacks standardized interventionist benchmarks against which to evaluate such claims. The case studies that have yielded the cleanest circuit discoveries — modular arithmetic [35], factual associations [25], board games [46] — are deliberately constrained tasks where ground truth is available and models are small enough to exhaustively analyze; whether the clean modular circuits and linear representations identified in these settings persist, or fragment into something less interpretable, at the scale of frontier language models remains an open empirical question. Systematic evaluation of SAE methods on vision, multimodal, and reinforcement learning models is largely absent, leaving open the question of how broadly the superposition framework applies [42]. As SAEs scale to millions of features, the scalability of human-in-the-loop evaluation becomes acute: automated scoring methods [18] help but introduce their own validity concerns. The translation from feature discovery to reliable safety intervention through representation engineering remains incompletely characterized — activation steering can modulate behavior, but whether such interventions are robust to adversarial pressure and distributional shift is largely untested [10]. Finally, a theoretical account of when and why SAE features correspond to “true” computational features rather than training artifacts remains underdeveloped [24, 42]—a gap that the structural causal model framework [47] and the formal undecidability results [30] help to articulate precisely, even if they do not yet resolve. Bridging the distance between the interventionist criteria that causal theory demands and the empirical methods currently available is among the most pressing open problems as the field scales toward frontier models.

4. Training Dynamics, Developmental Interpretability, and Learning Phase Transitions

The preceding section examined what circuits and features exist in trained neural networks and how they can be identified through decomposition and causal intervention. A complementary and increasingly consequential question asks how and when those internal structures form during the training process itself. This question defines the emerging research program of developmental interpretability, which treats the training trajectory — not just the final trained model — as the primary object of scientific inquiry. The findings reviewed in this section converge on a conclusion with significant implications for both scientific understanding and safety evaluation: deep network learning is fundamentally non-linear, characterized by discrete organizational events rather than smooth gradient descent, and the timing and character of these events are interpretable, measurable, and potentially controllable.

Grokking and the Mechanistic Anatomy of Delayed Generalization

The phenomenon of grokking — in which a model abruptly generalizes long after achieving near-perfect training accuracy — initially appeared to exemplify the opacity of deep learning. Early treatments characterized it as an essentially binary event: a network was either memorizing or generalizing, with the transition between them poorly understood. Nanda, Chan, Lieberum, and colleagues fundamentally revised this picture through detailed mechanistic scrutiny of small transformers trained on modular addition [16]. Their analysis revealed that the apparent discontinuity is an artifact of surface measurement. Beneath the loss curves, training proceeds through three continuous and causally distinct phases: memorization, circuit formation, and cleanup. More strikingly, the generalizing circuit is not an abstract computational pattern but a specific mathematical algorithm: the model implements Fourier multiplication using trigonometric identities, concentrating its computation at a small set of key frequencies [16]. This mechanistic account is further contextualized by the broader study of how training trajectories evolve across model scales, where similar phase-like transitions in internal representations have been documented [17].

This mechanistic decomposition transforms an opaque emergent event into a tractable causal sequence. The critical methodological insight is that internal progress measures — tracking Fourier coefficient growth, circuit exclusivity, and weight norms — reveal the formation of the generalizing circuit well before it becomes behaviorally dominant. A model can simultaneously maintain a generalizing algorithm in latent form while still exhibiting memorization behavior externally, meaning that behavioral metrics systematically lag behind structural ones [16]. This finding carries direct implications for safety evaluation: if capabilities can be latent in a model’s circuit structure before they manifest in observable behavior, then monitoring training dynamics at the circuit level may detect emerging capabilities — including potentially dangerous ones — earlier than behavioral benchmarks alone [10]. The challenge of reliable capability monitoring has been highlighted as a core priority in AI safety research, precisely because behavioral evaluations may fail to surface latent competencies in time for meaningful intervention [15].

The three-phase structure identified in grokking also resonates with the circuit discoveries reviewed in Section 3. The modular, algorithm-implementing circuits that Nanda et al. uncovered in the grokking setting — Fourier multiplication at specific key frequencies — share the organizational logic of the arithmetic circuits identified by causal mediation analysis [35] and the factual recall pipelines traced through intervention [25]. Brain-inspired modular training approaches have further corroborated that networks tend to consolidate computation into functionally specialized subnetworks under the right training conditions [38]. In all cases, the model consolidates computation into identifiable, functionally specialized subnetworks. What the grokking work adds is the temporal dimension: these circuits do not appear fully formed but crystallize progressively through ordered developmental stages, with memorization structures being gradually pruned as generalizing circuits strengthen.

Perplexity as a Universal Developmental Clock

If grokking establishes that training has internal phase structure at the level of individual circuits, the complementary question is whether such structure persists at the scale of large language models trained on naturalistic data. Xia, Artetxe, Zhou, and colleagues addressed this through systematic empirical analysis of intermediate checkpoints spanning OPT models from 125M to 175B parameters, categorizing tokens according to whether their perplexity follows stagnating, upward-trending, or downward-trending trajectories across training [17]. The central finding is both striking and theoretically clarifying: perplexity, rather than model size or raw compute, is the dominant predictor of the model’s current developmental stage and behavioral profile. Models of vastly different scales, when matched by perplexity, exhibit comparable token-level learning patterns, suggesting they traverse a shared developmental curriculum at equivalent points along a common perplexity axis [17]. This result is consistent with broader work on deep learning learning curves, which similarly documents that training progression can be characterized by structured, stage-like dynamics that persist across architectural and dataset variation [37].

This scale-invariance in learning order directly extends the phase-structured view established by the grokking work to naturalistic training at scale [16]. Rather than each model scale following idiosyncratic developmental trajectories, the evidence points toward a universal ordering of learned phenomena — a staged curriculum in which certain linguistic regularities, token types, and distributional properties are reliably acquired before others, regardless of model capacity. Complementary work on how BERT-style models rediscover classical NLP pipeline stages during training — moving from surface morphological features toward syntax and then semantics — further substantiates the idea that neural language models follow a principled, hierarchical acquisition order [12]. Xia et al. further documented a qualitatively important divergence between large and small models in how they handle unnatural or noisy text: larger models navigate these distributions more gracefully, while smaller models become trapped in suboptimal attractors that favor grammatical but hallucinated outputs [17]. This finding has direct relevance to understanding hallucination as a developmental phenomenon rather than merely a behavioral one: grammatical fluency and factual reliability can decouple across scales and training stages [31], implying that a model’s propensity to hallucinate may be diagnosable from its position on the developmental trajectory rather than solely from its outputs.

The proposal that perplexity functions as a universal developmental clock has conceptual resonance with the three-phase structure identified in grokking research [16]. In both cases, the insight is that training is not a monotonically smooth optimization but a structured progression with identifiable waypoints, and that the right measurement frame — internal circuit progress measures in one case, perplexity-matched behavioral profiles in the other — reveals regularities that raw performance metrics obscure [16, 17].

Mathematical Foundations: Token Clustering and Consensus Dynamics

The empirical and mechanistic observations of staged learning gain formal grounding through the mathematical analysis of Geshkovski, Letrouit, Polyanskiy, and colleagues, who model transformer self-attention as a mean-field interacting particle system in which tokens evolve on the unit sphere under attention-driven dynamics [51]. In this formulation, each token in a sequence is a particle whose trajectory through the network is governed by self-attention, with the softmax operation serving as an interaction kernel that pulls particles toward one another according to their mutual similarity. The mathematical scaffolding — drawn from continuity equations and gradient flows over probability measure spaces — yields a precise, two-timescale picture of token dynamics. At early layers, tokens undergo rapid coalescence into discrete clusters: semantically similar tokens are drawn together quickly, collapsing local variability while preserving inter-cluster structure. At later layers, these clusters interact through a much slower pairwise merging process that ultimately drives all tokens toward a single consensus point [51].

This theoretical prediction provides the first mathematically rigorous account of two phenomena that had long troubled practitioners: rank collapse, in which the effective dimensionality of the token representation matrix decreases monotonically over depth as the attention map converges toward a low-rank fixed point, and oversmoothing, in which deep transformers produce nearly identical representations for all input tokens regardless of their original content [51]. Crucially, Geshkovski et al. show that these pathologies are not architectural accidents but are in a precise sense inevitable consequences of the gradient flow structure of self-attention, at least in the absence of counteracting mechanisms such as skip connections or layer normalization [51]. The mechanistic interpretability work on specialized attention heads further corroborates this picture: empirical pruning studies have demonstrated that a small fraction of heads carry the overwhelming burden of structured computation, consistent with the theoretical prediction that attention dynamics progressively concentrate representational mass [13].

The two-timescale structure derived formally exhibits a suggestive parallel to the phase decompositions identified in grokking research and the scale-invariant training trajectories documented empirically. Nanda et al.’s mechanistic analysis of grokking, for instance, identifies a rapid early phase of memorization followed by a qualitatively distinct, much slower phase of circuit formation and generalization — a temporal profile that mirrors the fast-coalescence/slow-merging duality in the particle-system account [16]. In all three cases — formal dynamics, grokking, and empirical staged learning — the developmental arc is characterized by an initial rapid structural consolidation followed by slower refinement or convergence processes [16, 17, 51]. Whether this parallel reflects deep shared principles or is partly coincidental remains an open question, but the recurrence of this motif across mechanistic, empirical, and mathematical registers is notable. The formal framework also generates a productive tension with the circuit discovery findings reviewed in Section 3: the particle-system analysis predicts that under pure self-attention dynamics, token representations must eventually collapse to consensus — destroying precisely the linear, semantically structured geometry that emerges spontaneously in models like OthelloGPT [46]. This juxtaposition raises the question of what architectural features — skip connections, layer normalization, feed-forward sublayers — interrupt or slow the gradient flow toward consensus, preserving structured representations long enough for them to be computationally useful.

Implications for Safety and Evaluation

The developmental interpretability framework carries direct implications for the safety concerns discussed later in this review. If generalization corresponds to identifiable structural changes — the consolidation of sparse, interpretable circuits and the pruning of memorization structures [16] — then monitoring circuit formation during training becomes a viable strategy for detecting whether a model has learned the right solution for the right reasons, rather than merely achieving the right outputs through potentially fragile heuristics [10]. The finding that behavioral metrics lag behind structural ones [16] is particularly consequential for alignment evaluation: a model that appears behaviorally aligned at a given training checkpoint may already be developing internal circuits that will eventually support misaligned behavior, or conversely, may contain latent generalizing circuits whose beneficial properties are not yet externally visible [29]. This structural-behavioral dissociation also undermines the assumption that passing behavioral benchmarks constitutes sufficient evidence of robust, well-founded generalization [10]. The scale-invariant developmental curriculum documented by Xia et al. [17] further suggests that capability emergence at scale is not unpredictable but follows structured trajectories that, with appropriate internal monitoring [52], could in principle be anticipated and evaluated before capabilities become externally manifest.

Limitations and Open Questions

Several significant limitations constrain the current state of developmental interpretability research. First, the mechanistic decomposition of grokking has been demonstrated primarily on modular arithmetic tasks in small transformers [16]; whether the three-phase structure generalizes to richer capabilities — such as multi-step reasoning, factual recall, or social reasoning — in larger models remains an active frontier [53, 10]. Second, the formal mathematical results of Geshkovski et al. [51] are established for idealized settings, and extending them to stochastic gradient training on naturalistic data distributions remains non-trivial. Third, the scale-invariant curriculum documented by Xia et al. [17] has not yet been connected to specific circuit-level changes; the relationship between perplexity-matched developmental stages and the formation or dissolution of identifiable circuits constitutes an important open question. Finally, the field currently lacks scalable, semi-automated pipelines capable of tracking circuit formation longitudinally across training at the scale of frontier models [36, 10]. Existing automated circuit-discovery methods remain largely confined to small models and narrow tasks, making developmental interpretability a labor-intensive research activity rather than an operational safety tool.

Taken together, the evidence reviewed in this section establishes a coherent picture: deep network learning proceeds not through smooth, uniform optimization but through discrete, structured organizational events — phase transitions in which specific circuits crystallize, memorization structures are pruned, and representational geometry consolidates [37]. These structural changes are detectable through internal metrics that reliably precede any corresponding shift in behavioral output [46]. This temporal priority of structural change over behavioral expression gives developmental interpretability its distinctive character and practical significance. While the circuit-analysis methods reviewed in Section 3 offer a static anatomical view of what a trained model contains, developmental interpretability provides a dynamic, longitudinal complement — tracking not only which circuits exist but when and how they form, and whether the developmental pathway itself bears on the reliability and safety of the resulting model. These themes — the relationship between internal structure, training trajectory, and alignment — are taken up more directly in the Discussion, where the convergent evidence from circuit analysis, developmental dynamics, and probing studies is assessed for what it collectively implies about the prospects and limits of mechanistic approaches to AI safety.

5. Probing and Empirical Analysis of Language Model Internals

Empirical investigation into what language models learn and how they organise information internally has matured considerably since the mid-2010s, moving from qualitative observations of neural network behaviour toward rigorous, comparative, and increasingly mechanistic accounts of representational structure. Early work on convolutional networks laid conceptual groundwork, probing studies of encoder-based transformers then mapped linguistic organisation with precision, and more recent dynamical methods have opened new avenues for comparing computation across architectures. Taken together, these lines of inquiry answer a deceptively simple question: what, concretely, do neural networks know, and where do they store it?

Emergent Concept Detectors: Early Evidence from Convolutional Networks

Before transformer-based language models dominated empirical probing research, foundational work on convolutional neural networks (CNNs) established that intermediate layers could develop interpretable, structured representations without explicit supervision. [5] introduced two complementary visualisation tools — one generating preferred stimuli for individual neurons via regularised gradient ascent, the other visualising live activations in response to natural images — and demonstrated that intermediate convolutional layers develop specialised detectors for concepts including faces, text, and flowers, even when those concepts were never presented as training labels. This finding was subsequently deepened by [3], who showed through seven complementary validation methods that early layers of InceptionV1 contain curve detectors responding to oriented lines, mid-level layers contain high-low frequency detectors sensitive to texture boundaries, and later layers contain pose-invariant object-part detectors — all emerging from gradient descent without explicit supervision. Crucially, [5] also observed that higher fully connected layers exhibit unexpected sensitivity to small or trivially modified inputs, while lower convolutional layers remain comparatively robust — a finding that foreshadowed later debates about where and how robustly meaning is encoded in deep networks. This work established the now-standard intuition that neural representations are neither uniform across depth nor merely a product of explicit training signal, a principle that would guide subsequent probing methodology in NLP.

Bidirectional Language Models and Hierarchical Linguistic Organisation

The conceptual move from visualising convolutional feature detectors to probing linguistic representations in contextual word embeddings was formalised around 2018, when systematic comparisons of different bidirectional language model (BiLM) architectures became feasible. [21] conducted a controlled comparison of three BiLM architectures — LSTM, Transformer, and Gated CNN — trained on the same data, evaluating their contextual representations against GloVe baselines across four downstream NLP benchmarks. All three architectures substantially outperformed static GloVe vectors, with relative improvements of 13–25%, but the more consequential finding was architectural: regardless of the underlying model family, BiLMs encode a consistent hierarchical linguistic structure in which the word embedding layer captures morphological information, lower contextual layers encode local syntactic patterns, and upper contextual layers represent more abstract semantic properties such as coreference [21]. The convergence of this hierarchy across LSTM, Transformer, and CNN architectures suggests the organisation reflects something about the inductive structure of language itself, not merely the architectural biases of any single model family. This cross-architectural consistency would prove an important baseline against which BERT-specific findings were subsequently evaluated. Subsequent probing studies confirmed and extended this picture: [11] showed that BERT’s twelve Transformer layers reproduce the same surface-to-semantic ordering — surface features in lower layers, syntactic information in middle layers, and semantic properties in upper layers — and further demonstrated that BERT’s representations are best approximated by tree-structured rather than linear compositional schemes, suggesting that attention-based models implicitly recover classical syntactic hierarchies. [12] corroborated this using edge probing on eight pipeline tasks, finding a consistent ordering from part-of-speech tagging resolved earliest through to coreference at the highest layers, though with the nuance that individual examples can resolve linguistic phenomena out of sequence when high-level semantic context disambiguates lower-level syntactic decisions.

BERT and the Classical NLP Pipeline

The publication of BERT in 2018 prompted a wave of probing studies seeking to characterise what a large pretrained transformer encoder learns. This line of inquiry had important predecessors: Peters et al. [21] demonstrated using ELMo-style bidirectional language models that contextualised representations encode a graded linguistic hierarchy — morphological information in the embedding layer, local syntactic patterns in lower contextual layers, and longer-range semantic content such as coreference near the top — establishing the methodological template of layer-wise probing that BERT studies would inherit. Two influential studies of BERT itself, published concurrently in 2019, reached broadly compatible but meaningfully different conclusions about the degree to which linguistic information is localised versus distributed across BERT’s layers.

[12] employed edge probing tasks — minimal classifiers trained on contextualised span representations — aligned to the traditional NLP pipeline components: part-of-speech tagging, constituent parsing, dependency parsing, semantic role labelling, and coreference resolution. Their central claim is that BERT “rediscovers” this classical pipeline: part-of-speech information is encoded earliest, followed by constituent structure, dependencies, semantic roles, and finally coreference at the highest layers. Syntactic information, in this account, concentrates in relatively specific layers, while semantic information distributes more broadly across the network [12]. The analogy to the NLP pipeline is deliberate and productive: it frames BERT’s representational hierarchy as recapitulating decades of task-ordered linguistic reasoning.

[11] reached a broadly similar conclusion but characterised the gradient differently. Using complementary experimental approaches — analysis of phrasal syntax through representational similarity analysis, probing classifiers, and visualisation of sentence embeddings — they found that lower layers capture phrase-level surface information effectively, middle layers encode syntactic structure most strongly, and upper layers are most sensitive to semantic content [11]. Consistent with Peters et al.’s earlier findings on biLMs [21], this pattern holds across architectural choices, suggesting it reflects a general property of deep language models trained on next-token prediction objectives. Where [12] emphasise the concentration of syntactic information in specific layers, [11] describe the transition as more gradual, with information about each level diluting rather than disappearing as depth increases. This tension — specialisation versus gradual distribution — remains unresolved and reflects in part the difficulty of attributing representational responsibility to individual layers when information is distributed and layers interact through attention. Complementary work on attention heads has further complicated the picture, showing that individual heads within a single layer can specialise for distinct syntactic functions [13], adding a within-layer dimension of functional organisation that layer-level probing alone cannot capture.

Methodological Debates: What Probing Probes

The proliferation of probing studies has also sharpened methodological scepticism. A central concern is that high classification accuracy from a probing classifier may reflect the expressiveness of the probe itself rather than genuine encoding of a linguistic property in the model’s representations. If a sufficiently powerful probe can extract syntactic structure from almost any distributed representation, then probe accuracy is a weak signal of targeted encoding [21]. This concern applies to both [12] and [11], both of which use linear or near-linear classifiers but nonetheless face the fundamental question of what successful probing implies. Neither study provides a definitive resolution; rather, they acknowledge the limitation and appeal to control conditions and cross-layer comparison as partial mitigations.

Recent work has subjected this limitation to more rigorous scrutiny. Kunz demonstrated that probing classifiers may achieve high accuracy not because they extract explicit linguistic structure from neural representations, but because they exploit linear contextual features incidentally present in the training data [39]. If the probe’s success is attributable to surface statistical regularities rather than genuine encoding, then the interpretive conclusions drawn from probing experiments are substantially undermined. A related line of work has proposed information-theoretic and minimum-description-length frameworks as more principled alternatives to accuracy-based probing evaluation, seeking to distinguish probes that compress representations efficiently from those that merely overfit to incidental regularities [39]. Kunz further examined self-rationalization — in which models generate natural language explanations of their own outputs — and found that human-rated explanation quality did not reliably predict downstream model performance, with generated explanations frequently containing incomplete or formally inadequate reasoning chains [39]. Together, these findings suggest that the two most accessible windows into LLM internals — probing and self-explanation — are both methodologically fragile in ways that the field has not sufficiently acknowledged.

The debate over probing methodology remains active and has implications for interpreting all findings reviewed here: reported layer-wise hierarchies should be understood as findings about extractability rather than unambiguous claims about representational structure per se [2]. This concern is directly relevant to the safety applications of probing discussed later in this review: if probing classifiers are used to detect deceptive alignment or misaligned internal representations, the distinction between what is extractable and what is causally operative becomes a matter of practical consequence, not merely methodological refinement [10, 39]. Causal representation learning frameworks, which demand that extracted features be shown to intervene on model behaviour rather than merely correlate with it, offer one principled path forward [47, 48].

Dynamical Similarity Analysis: Beyond Geometric Comparison

A different empirical tradition addresses not what representations encode but how computations unfold over time — a question particularly salient for recurrent models and, increasingly, for transformers understood as dynamical systems [51]. [54] introduced Dynamical Similarity Analysis (DSA), which combines Koopman operator theory with Dynamic Mode Decomposition to compare the temporal structure of neural computations rather than the geometry of activation spaces at a single time step. Koopman operator theory provides a principled framework for lifting nonlinear dynamics into a linear space where they become analytically tractable [55], making it a natural foundation for cross-network comparisons of temporal computation. The key demonstration is that DSA successfully distinguishes computationally equivalent networks where geometric methods fail: networks with different architectures or activation functions but equivalent attractor topology are correctly identified as similar by DSA, while networks with geometrically indistinguishable activation trajectories but distinct underlying dynamics — for example, networks implementing different decision-making strategies — are correctly distinguished [54]. This represents a conceptual advance over representational similarity analysis [56] and related geometric methods [57], which compare pairwise distances between activation patterns across stimuli or inputs and are therefore insensitive to temporal dynamics by construction. However, DSA has not yet been applied at scale to large language models; the demonstrations in [54] target smaller, more controlled neural circuit models, and scaling the Koopman-based framework to billion-parameter transformers remains an open engineering and conceptual challenge.

Gaps and Outstanding Questions

Several important limitations circumscribe the current state of this literature. Most probing studies — including both BERT analyses reviewed here — target encoder-only architectures that are no longer the dominant paradigm; systematic probing of decoder-only models such as GPT-4 or Llama families at comparable depth and breadth is sparse [39], leaving open whether the NLP pipeline hierarchy generalises to autoregressive generation. The relationship between probing-discovered representations and features identified through sparse autoencoder (SAE) decomposition of model activations [18, 42] has not been systematically mapped, representing a significant gap between two complementary empirical traditions — one that is sharpening as SAE methods are extended to attention layer outputs [43] and to explicitly circuit-structured analyses [19]. The vulnerability of probing to surface statistical confounds [39] further complicates the interpretation of earlier results and motivates the development of more rigorous control conditions. DSA and related dynamical methods, though promising, remain largely untested at the scale of contemporary large language models [54]. Finally, and perhaps most consequentially for applied use, there is limited work connecting layer-wise linguistic findings to downstream task performance or failure modes: knowing that coreference information concentrates in BERT’s higher layers [12] does not yet straightforwardly predict when or why coreference resolution fails in practice.

Despite these limitations, the body of work reviewed in this section establishes a robust empirical picture that the methodological caveats qualify but do not dissolve. Across architectures — CNNs, LSTMs, Transformers, and Gated CNNs alike — neural networks trained on language or vision tasks develop hierarchical internal organisations in which progressively more abstract properties are encoded at greater depth, with morphological and surface features appearing earliest, syntactic structure in intermediate layers, and semantic content near the top [21, 12, 11]. The cross-architectural consistency of this organisation suggests it reflects genuine structure in the problem domain rather than an artefact of any particular model family. These findings should be read as claims about what information is extractable from representations at each layer, with causal and mechanistic force remaining to be established — yet even as extractability findings, they carry substantial interpretive weight. Crucially, the layer-wise linguistic organisation identified through probing complements the circuit-level decompositions discussed in Section 3 [14, 10]: where probing identifies the where of linguistic knowledge, circuit analysis begins to address the how, tracing the computational mechanisms by which that knowledge is applied. Connecting these two empirical traditions — mapping the geometry of what is encoded onto the functional circuits that use it — is among the most important open problems in mechanistic interpretability [10].

6. Neural Networks as Models of Biological Cognition and Brain Computation

The question of whether artificial neural networks genuinely model biological cognition, or merely approximate its surface statistics, has become one of the most productive and contested problems in contemporary computational neuroscience. What began as a set of exploratory analogies between connectionist architectures and neural circuits has matured into a rigorous empirical research programme, complete with standardized methodologies, cross-modal comparisons, and increasingly sharp philosophical scrutiny. The arc of this literature moves from early enthusiasm about representational correspondence, through methodological consolidation, toward a present moment defined less by accumulating confirmations than by deepening debates about what such correspondence actually means.

Deep CNNs and the Primate Ventral Visual Stream

Early systematic work establishing the brain-DNN correspondence in vision was pivotal in legitimizing the entire comparative enterprise. [56] provided an influential synthesis demonstrating that deep convolutional networks trained on object recognition tasks develop internal representational spaces strikingly similar to those of primate inferior temporal (IT) cortex. Crucially, this similarity was not uniform: it increased monotonically across network hierarchy, with lower layers corresponding to earlier visual areas (such as V1 and V2) and deeper layers converging on IT cortex organization — a correspondence assessed using representational similarity analysis (RSA) to compare the geometry of neural population responses across both biological and artificial systems [57]. A further significant finding was that computational performance on recognition benchmarks and biological fidelity were positively correlated, suggesting that the pressures of solving visual recognition — not any deliberate biological constraint — drove convergence toward brain-like solutions [56]. This framing was consequential: it implied that biological plausibility was an emergent property of task optimization rather than a design choice.

[58] subsequently synthesized and extended this picture within a broader primer for neuroscientists, articulating three distinct roles for ANNs in neuroscience: as candidate models of complex brain functions, as representational tools for analyzing neural population activity, and as windows onto evolutionary and developmental pressures that shape neural circuits. Their treatment of deep convolutional networks reinforced the hierarchical correspondence thesis while situating it within a normative framework — the idea that networks trained under ecologically valid task demands recapitulate biological solutions because those solutions are, in some principled sense, optimal for the task [59, 60]. This representational toolkit framing has become standard in the field, though it carries assumptions that subsequent work has complicated.

Representational Similarity Analysis and Methodological Infrastructure

The methodological backbone enabling these comparisons deserves explicit attention. Representational Similarity Analysis (RSA) and neural encoding models have become the lingua franca of brain-model alignment research [56, 58]. RSA operates by comparing the geometry of representational spaces — asking whether stimuli that are represented similarly by a network are also represented similarly by a brain region — without requiring one-to-one correspondence between individual units. Encoding models regress network activations onto neural responses to assess predictive power [61, 62]. These approaches are powerful precisely because they are agnostic about implementation details, but this agnosticism is also their central limitation: representational similarity, as [60] note, does not imply mechanistic equivalence. Two systems can produce geometrically similar representations through entirely different computational processes, and the RSA framework is blind to this distinction. This point is sharpened by Ostrow et al. [54], who demonstrate that geometric methods fail to capture fundamental neural computations arising from dynamical properties such as fixed points and limit cycles — motivating their Dynamical Similarity Analysis (DSA), which compares systems at the level of dynamics rather than spatial geometry and can correctly distinguish networks with similar geometries but distinct underlying computations. This observation anchors what [60] frame as the central epistemological challenge: deep learning models may be the empirical answer to many neuroscientific pattern-matching questions while leaving the underlying mechanistic question untouched.

Transformer Language Models and Neural Language Processing

The correspondence story gained its most quantitatively striking chapter when researchers turned from vision to language. [63] reported that transformer-based models, particularly GPT-2, substantially outperformed simpler architectures across a systematic comparison of 43 ANN models tested against three large-scale neuroimaging datasets spanning fMRI and ECoG modalities. When brain-model similarity was normalized against the noise ceiling — the maximum possible predictable variance given measurement noise — GPT-2 explained nearly 100% of explainable neural variance in language-selective cortex. The robustness of this finding across imaging modalities, participant samples, and experimental paradigms made it difficult to dismiss as an artifact of any single methodological choice. Predictive processing emerged as the key organizing principle: models that were better at predicting upcoming words were better at predicting brain responses, linking computational performance to neural correspondence in a theoretically motivated way [63].

[64] pursued the predictive processing hypothesis through converging behavioral and neural evidence. In a paradigm involving natural speech processing, the word predictions generated by GPT-2 correlated with human behavioral predictions at r = 0.79, a remarkably high correspondence [64]. More theoretically significant was the finding that neural signals up to 800 milliseconds before word onset already encoded information about forthcoming words, demonstrating that biological language comprehension involves continuous anticipatory processing rather than purely reactive decoding [64]. This temporal finding is important: it suggests that the predictive processing principle is not merely a statistical regularity shared by humans and GPT-2 but is actively implemented in the temporal dynamics of neural computation.

Complementary work has moved beyond asking whether transformer representations predict brain activity to asking which internal computations are responsible. [62] demonstrated that the attention head transformations within BERT — the actual computations mediating information flow between words — predict language-network fMRI responses with greater layer-specificity than embeddings alone, and that the structural properties of individual attention heads (look-back distance, layer depth, syntactic specialization) map onto a coherent cortical gradient: posterior temporal cortex favours early-layer, short-range heads consistent with lower-level syntactic processing, while prefrontal and anterior temporal regions preferentially recruit heads encoding longer-range contextual dependencies [62]. Further supporting the generality of this representational hierarchy, [65] showed that a low-dimensional embedding of 100 diverse language representations — spanning word embeddings, GPT-2, and machine translation systems — reliably predicts which representations best encode specific brain regions, with angular gyrus and precuneus exhibiting the highest alignment with the deepest, most abstract model layers, recovering known hierarchical language processing anatomy with approximately 90% cross-validated accuracy [65].

Recurrent Networks, Predictive Processing, and the Limits of Correspondence

While the vision and language domains have dominated this literature, [60] broaden the scope by reviewing evidence that recurrent neural networks trained on cognitive tasks reproduce population-level dynamics in prefrontal and parietal cortex. This extension is significant because it moves brain-model comparison beyond perceptual domains into territory more closely associated with flexible, goal-directed cognition. The use of task-trained RNNs as models of prefrontal function has been independently motivated by work demonstrating that population-level geometries and temporal dynamics in such networks closely mirror those observed in primate cortex during working memory and decision-making tasks [58, 66]. However, [60] are careful not to overstate the implications: they argue that deep learning models provide genuinely explanatory theories for neuroscience, not merely predictive instruments, while simultaneously insisting on distinguishing explanation from mechanism. The value of these models, in their framing, lies in generating precise, falsifiable hypotheses about neural computation — not in claiming identity between artificial and biological processes.

The degree to which GPT-2’s predictive behavior reflects a shared algorithm with human language processing, versus a superficial behavioral similarity grounded in shared training statistics, remains genuinely unresolved [64, 63]. Critics of the high-variance-explained finding note that both GPT-2 and the human brain have been exposed to similar distributional properties of natural language, and that high correlation may therefore reflect common input statistics rather than common computational mechanisms [67, 65]. This is not a trivial objection: it implies that the impressive neural alignment scores may be inflated by shared ecology rather than shared cognition. Disentangling these two sources of alignment — statistical overlap in training data versus genuinely convergent computational strategies — remains one of the central open methodological challenges in this literature [62].

Biological Design Principles and Their Computational Parallels

Beyond representational correspondence, a growing body of work identifies specific biological design principles — operating at levels from individual synapses to large-scale brain networks — that parallel computational phenomena central to the mechanistic interpretability program. These findings extend the brain-DNN comparison from a question about representational similarity to a deeper inquiry into whether biological and artificial systems are subject to shared organizational constraints.

At the level of neural coding, the excitatory-inhibitory (E/I) balance literature provides a biological analogue to the information compression that regularization and sparsity constraints achieve in artificial networks. Zhou and Yu formalized the observation that excitatory and inhibitory synaptic inputs cancel on slow timescales to produce the irregular, Poisson-like firing observed in cortical neurons, demonstrating that coding efficiency peaks at a “sweet spot” coinciding with the transition between synchronous and asynchronous network states — a regime where information transmission is maximized and metabolic cost is minimized [68]. The analogy to the information bottleneck principle in machine learning is striking: just as sparse regularization in SAEs seeks representations that are maximally informative while discarding irrelevant variance [24], E/I balance achieves efficient coding by calibrating competing synaptic drives. This suggests that the sparsity properties exploited by the SAE methodology reviewed in Section 3 may reflect not merely a useful engineering heuristic but a substrate-independent organizational principle.

At the developmental level, Fair and colleagues established through longitudinal graph-theoretic analysis of resting-state fMRI across 210 participants aged 7 to 31 that functional brain networks undergo systematic reorganization from anatomical proximity to functional identity across development [69]. In children, regions cluster primarily by spatial adjacency; in adults, they cohere according to shared functional network membership, with short-range connections weakening and long-range functional ties strengthening [69]. This local-to-distributed developmental trajectory bears a compelling resemblance to the dynamics observed in deep learning systems, where early layers initially capture local, low-level features before higher layers abstract increasingly distributed and task-relevant representations [56] — and to the phase-structured training dynamics reviewed in Section 4, where early training produces diffuse representations that progressively specialize [16, 17]. If the transition from local to distributed organization is a signature of functional maturation in biological systems, analogous metrics might serve as diagnostics of representational development in artificial networks.

The question of how globally coherent representations can emerge from purely local learning rules is addressed by George and colleagues, whose biologically plausible Helmholtz machine model of the hippocampal-entorhinal system demonstrates that path integration — a computationally demanding spatial navigation capability — emerges spontaneously from local prediction-error minimization, with medial entorhinal cortex neurons self-organizing into ring attractors [70]. This result demonstrates that global structure can be a thermodynamic endpoint of local dynamics, without requiring the explicit global objectives that backpropagation provides. The model’s wake-sleep training regime, gated by theta-band oscillations, positions biological learning as inherently generative — a perspective that connects to the finding that autoregressive transformers trained solely on next-token prediction spontaneously develop linearly decodable world models [46]. Both results suggest that predictive learning under appropriate architectural constraints can yield structured internal representations without explicit representational supervision [60].

Finally, the cognitive psychology literature on working memory capacity identifies a hard representational bottleneck with direct implications for understanding compression in both biological and artificial systems. Cowan’s synthesis established that the true capacity limit of working memory is approximately four chunks — substantially lower than Miller’s classical estimate of seven plus or minus two — with chunking strategies serving as the primary mechanism for circumventing this constraint [71]. The concept of chunking, whereby familiar patterns are recoded into single higher-order units, is directly analogous to the compression operations central to learned embeddings and dictionary learning: both reduce the effective dimensionality of information presented to downstream processing stages. Cowan’s approximately four-chunk limit thus functions as an empirical anchor for theoretical arguments about the necessity of hierarchical compression in any system that must integrate information under capacity constraints, whether biological or computational [71].

Philosophical Dimensions: Representation and Its Discontents

These empirical debates have an underappreciated philosophical underdimension. [72] offers a systematic account of representation in cognitive science, distinguishing between different senses in which a physical state can be said to carry content. This framework is relevant to the brain-model comparison literature in a way that has not been sufficiently recognized: when researchers claim that DNNs and brains share “representations,” they are typically relying on an operationalized, similarity-based notion of representation that [72]‘s framework would treat as only one component of a richer concept. The question of whether representational similarity in the RSA sense implies that networks and brains share content — in the philosophically loaded sense — remains largely unaddressed in the empirical literature. This concern is compounded by work showing that geometric similarity metrics capture only a cross-sectional snapshot of neural computation; [54] argue that the temporal structure of computation in neural circuits — how representational geometry evolves dynamically — is systematically missed by standard RSA-style comparisons, pointing to an additional dimension along which the “shared representation” claim may be underspecified.

Several important gaps circumscribe current understanding. The vast majority of brain-model comparisons remain concentrated in vision and language, leaving auditory, motor, and multimodal systems comparatively unexplored [58, 60]. The causal role of prediction in both biological and artificial systems has been established correlationally but not yet through interventional paradigms that could distinguish prediction as a computational principle from prediction as an emergent statistical regularity [64]. Perhaps most notably, the mechanistic interpretability tools being developed for ANN analysis — including circuit discovery [14, 10] and sparse autoencoder decomposition — have not yet been systematically applied to neuroscience-facing questions about brain-model alignment, a gap that represents one of the most promising frontiers for future work. The biological design principles reviewed here — E/I balance [68], developmental reorganization [69], local-to-global structure emergence [70], and capacity-driven compression [71] — identify specific mechanisms whose computational analogues in artificial systems could be tested using the circuit discovery and SAE methods reviewed in Section 3. And [72]‘s philosophical framework, which offers conceptual resources for precisely the representational questions at stake, has yet to be systematically connected to empirical findings in this domain. Bridging that conceptual gap may be as important as accumulating further correlational evidence.

Taken together, the findings reviewed in this section establish a picture that is simultaneously encouraging and incomplete. The brain-DNN correspondence is empirically robust in vision and language — monotonic hierarchical alignment in the ventral stream [56, 67], near-ceiling variance explanation in language-selective cortex [63, 64] — and mechanistically suggestive through the biological design principles examined above: E/I balance as a substrate-independent sparsity principle, developmental reorganization as a local-to-distributed trajectory with clear artificial analogues, and local predictive learning as a sufficient condition for globally structured representations. Yet the epistemological limitation identified by [60] and sharpened by [72]‘s representational framework remains: geometric similarity between activation spaces, however striking, does not establish that biological and artificial systems arrive at their representations through equivalent computational processes [54]. Closing this gap — moving from correspondence to mechanism — will require precisely the circuit-level and SAE-based methods that the mechanistic interpretability programme has begun to develop [10, 4]. The convergence of neuroscientific and mechanistic interpretability approaches therefore represents not merely a promising research frontier but arguably a necessary one, if the brain-DNN correspondence is to mature from an empirical regularity into a genuinely explanatory theory of biological and artificial cognition alike.

7. Classical and Gradient-Based Explainability Methods

Explainability research in neural networks predates the contemporary mechanistic interpretability movement by several decades, yet the foundational methods developed during this earlier period continue to shape how practitioners and researchers approach the problem of understanding model behavior. The classical tradition encompasses at least three distinguishable intellectual lineages: axiomatic attribution methods that formalize what it means to “credit” an input feature for a prediction, visualization and propagation techniques that trace information flow through network layers, and rule-extraction and causal approaches that seek to render network behavior in symbolic or manipulable form. These traditions share a common concern with faithfulness—whether an explanation accurately reflects what the model is actually doing—but differ substantially in their formal commitments, their assumptions about network architecture, and their practical utility for end users.

Rule Extraction and Early Formalism

The earliest formal work in this lineage predates the deep learning era. The SUBSET method introduced by [73] in 1994 established a template for extracting interpretable propositional rules from trained networks through structured sampling and querying. A closely related contemporaneous effort, NeuroRule [28], pursued the same goal through a three-phase pipeline of training, pruning, and rule extraction: aggressive pruning could reduce hundreds of connections to a handful while preserving classification accuracy, after which discretized hidden-unit activations were enumerated to yield concise, human-readable decision rules. By treating the neural network as a black-box oracle to be interrogated rather than a transparent system to be read, these approaches foregrounded a tension that persists throughout the field: explanations are often constructed about a model rather than from it [23, 74]. The rules extracted through such sampling and pruning procedures can be verified, inspected, and communicated to non-expert stakeholders, offering a degree of legibility that raw gradient information cannot [1]. However, the method’s scalability to modern deep architectures—with millions to billions of parameters and continuous, high-dimensional representations—was never demonstrated, and this gap remains an outstanding limitation of the rule-extraction tradition more broadly [2, 27].

Axiomatic Attribution: Integrated Gradients

The theoretical ambitions of the field advanced significantly with the introduction of Integrated Gradients (IG) by [7]. Rather than proposing yet another heuristic attribution score, Sundararajan and colleagues grounded their method in a set of formal axioms: Sensitivity (if a feature differs between input and baseline and changes the output, it receives nonzero attribution), Implementation Invariance (two functionally equivalent networks must produce identical attributions), Linearity (attributions compose across linearly combined models), and Completeness (attributions sum to the difference in output between input and baseline). Notably, Implementation Invariance is violated by gradient methods that depend on discrete approximations or layer-specific propagation rules — a failure mode documented in earlier saliency approaches [6]. The authors demonstrate that IG is the unique single-path method that also satisfies Symmetry, assigning identical attributions to variables that are interchangeable in the function and take identical values [7]. This axiomatic framing was genuinely new: it moved the conversation from “which heuristic is most visually plausible?” to “which attribution rule can be formally justified?” — a shift subsequently recognized as a landmark contribution to interpretability methodology [1].

The practical mechanism of IG — integrating gradients along a straight-line path from a baseline input to the actual input, typically approximated using 20–300 interpolation steps — is elegant but introduces a significant free parameter: the choice of baseline. A black image for vision tasks or an all-zeros embedding for text are common defaults, but the attribution profile can shift substantially depending on this choice, and no consensus has emerged about what constitutes an “optimal” baseline [7, 1]. This indeterminacy is not merely a technical inconvenience; it raises deeper questions about whether the method is truly measuring something about the model or something about the relationship between the model and an arbitrary reference point [2]. The debate over baseline selection remains unresolved and represents one of the most persistent open questions in the axiomatic attribution literature.

Layer-Wise Relevance Propagation for Text

Contemporaneously, [75] extended Layer-wise Relevance Propagation (LRP) to natural language processing, applying it to convolutional neural networks trained on the 20 Newsgroups benchmark. LRP operates through a backward pass that redistributes relevance from output neurons to input features according to local conservation rules, and was originally developed for image classification by Bach et al. [76], who validated it across datasets including PASCAL VOC, MNIST, and ImageNet by generating pixel-wise heatmaps that highlight evidence for or against a predicted class without requiring model retraining. The adaptation to word-embedding-based CNNs required reconsidering how relevance should flow through embedding layers—a non-trivial architectural concern. Empirically, the resulting heatmaps were found to be sparser and more semantically coherent than explanations derived from SVM-based baselines using TF-IDF features, which frequently assigned high weight to function words or proper names as artifacts of term-frequency statistics [75]. Despite achieving comparable classification accuracy (approximately 80% on 20 Newsgroups), the CNN-LRP framework produced qualitatively richer explanations. Montavon et al. further systematized the theoretical foundations of LRP alongside related propagation-based methods, clarifying the conditions under which conservation rules yield stable and interpretable attributions [6].

This comparison between LRP and gradient-based methods, however, also surfaced a fundamental disagreement that the field has not resolved: the two approaches can produce divergent relevance maps for identical predictions on identical inputs. Since there is no ground-truth explanation against which to evaluate either, practitioners face a genuine epistemic impasse. Selecting between LRP and IG-style attributions often reduces to pragmatic considerations—computational cost, architectural compatibility, or domain convention—rather than principled grounds. This divergence problem compounds the more general concern about whether any post-hoc attribution method offers a faithful account of model reasoning or merely a plausible rationalization that satisfies human intuitions about salience [75, 7]. Systematic comparative evaluation of explanation methods has begun to address this problem: Samek et al. demonstrated that different techniques vary meaningfully in faithfulness, interpretability, and computational efficiency, and that LRP in particular offers substantial speed advantages over occlusion and integrated gradients — running roughly an order of magnitude faster — while maintaining competitive explanation quality [1]. Their finding that explanation methods can reveal “Clever Hans” effects, where models exploit spurious correlations rather than meaningful features, further underscores the practical stakes of choosing among these approaches: an unfaithful explanation method may not only mislead practitioners but may actively conceal the model’s reliance on artifacts [1].

Causal and Concept-Based Explanations

A distinct approach to explanation treats the network’s internal representations, rather than its inputs, as the object of analysis. [48] proposed inserting autoencoders at intermediate network depths to extract low-dimensional, interpretable concept representations from activations, then using these representations to support causal queries about network behavior. Crucially, this framework enables counterfactual reasoning: one can ask not only “why did the model predict X?” but “what would have to change for it to predict Y instead?” Such counterfactual queries are particularly valuable for debugging, bias detection, and satisfying emerging regulatory requirements for algorithmic accountability [48, 74, 22]. The autoencoded activation approach anticipates, in important ways, the Sparse Autoencoder (SAE) methodology that would later become central to mechanistic interpretability [18, 42]. SAEs address the problem of polysemanticity—where individual neurons activate for multiple unrelated features—by learning sparse dictionaries of monosemantic feature directions from activations, enabling both interpretability scoring and precise causal interventions such as activation patching [18]. An outstanding question is whether the causal structure extracted by autoencoded concept representations can be integrated with or compared against the feature decompositions provided by SAEs [41].

Adversarial Examples as Contrastive Explanation Tools

[77] extended the conceptual repertoire of explainability by arguing that adversarial examples and spoofing techniques—typically studied as security vulnerabilities—can be repurposed as explanatory instruments. By examining where a model’s classification boundary lies and what minimal perturbations suffice to cross it, one can generate contrastive explanations that answer the question “why this prediction rather than that one?”—arguably the form of explanation most aligned with how humans naturally seek to understand decisions [78, 23]. Boundary analysis of this kind probes behavior in regions of input space that standard attribution methods ignore, offering complementary diagnostic information about model fragility and decision structure [77]. This reframing of adversarial phenomena as explanatory resources rather than mere failure modes was prescient: contrastive explanation has since become a recognized desideratum in the XAI literature [22, 79], with surveys explicitly cataloguing it as a distinct and practically valuable explanation type [23, 80].

However, recent work has revealed that the relationship between adversarial methods and interpretability cuts in both directions. Nanfack and colleagues demonstrated that activation maximization—a widely used feature visualization technique, established in foundational work by Yosinski et al. [5], that is designed to reveal the input patterns that most strongly activate specific neurons—can be systematically deceived through adversarial fine-tuning, with minimal degradation to classification accuracy [81]. Three distinct attack strategies were operationalized: a push-down attack that erases the existing interpretation of a neuron, a push-up attack that substitutes a targeted but false interpretation, and a fairwashing attack that conceals model biases by replacing discriminatory activations with benign-appearing ones [81]. The efficiency of these attacks means that feature visualizations cannot be treated as ground truth even in high-stakes contexts [82]—a finding with troubling implications for interpretability pipelines that rely on visualization as an auditing tool. Taken alongside the probing methodology concerns reviewed in Section 5 [39], this result reveals that two of the field’s most widely used post-hoc interpretability methods are vulnerable from opposite directions: one to inadvertent methodological confounds, the other to intentional deception.

Synthesis and Outstanding Gaps

Together, these classical methods established the foundational vocabulary of the field—attribution, relevance, causality, contrastiveness—and formalized the normative standards against which newer approaches are evaluated [1, 6]. Yet several limitations are structural rather than merely technical. First, systematic comparisons between classical attribution methods and more recent mechanistic approaches, such as SAE-based feature decomposition [18, 42], on identical models and tasks remain largely absent from the literature [9], making it difficult to assess relative faithfulness. Second, the adaptation of gradient-based methods to modern autoregressive language models at scale introduces architectural complexities—attention mechanisms, residual streams, tokenization—that the original frameworks were not designed to handle [10, 2]. Third, the adversarial vulnerability of feature visualization [81] raises urgent questions about the evidential status of interpretability conclusions throughout the preceding decade that relied on such methods. Fourth, and perhaps most fundamentally, user studies evaluating whether classical attributions actually improve human decision-making with AI systems are sparse, leaving the practical value of these methods underexplored [77, 7, 75]. The classical tradition built the theoretical foundations; the field still awaits definitive evidence that those foundations translate into reliable epistemic tools for practitioners.

The concepts and normative standards forged across these classical lineages — what counts as a faithful attribution, what desiderata a sound explanation must satisfy, how causality and contrastiveness complement input-level salience — did not remain confined to primary research. They became the shared vocabulary that subsequent survey and taxonomy literature would seek to codify, systematize, and extend to the broader landscape of explainable AI [22, 23]. It is precisely this synthesizing ambition, and the epistemological commitments it reflects, that motivates the meta-level frameworks examined in the next section.

8. Surveys, Taxonomies, and Frameworks for Explainable AI

The field of explainable artificial intelligence (XAI) has generated a substantial meta-literature of surveys, taxonomies, and programmatic frameworks that collectively define its vocabulary, map its design space, and identify its persistent open problems. These synthesizing works do not merely catalog techniques — they reflect evolving epistemological commitments about what explanation means, who it serves, and whether the challenges posed by modern machine learning are genuinely novel or are recapitulations of older scientific problems. Tracing this meta-literature reveals a field that has matured considerably since its early institutional consolidation, while simultaneously confronting new challenges introduced by large language models and generative AI that strain the adequacy of existing conceptual frameworks.

Institutional Foundations and Early Program Logic

The modern XAI research agenda was substantially shaped by the DARPA Explainable Artificial Intelligence program, whose rationale and architecture were articulated by [83]. Gunning and Aha identified a fundamental tension at the heart of contemporary machine learning: the systems achieving the highest predictive performance — deep neural networks in particular — are precisely those whose internal workings are least interpretable to human users [1, 2]. The DARPA program responded by organizing research around three strategic thrusts: deep explanation, which modifies neural network architectures to produce inherently more interpretable representations; interpretable models, which favor learning structured, causal relationships over opaque statistical patterns; and model induction, which attempts to infer post-hoc explanations by treating an existing model as a black box and approximating its behavior through more legible surrogates — an approach instantiated in influential methods such as LIME [84] and SHAP [85]. This tripartite taxonomy was influential not simply because it organized a funding program, but because it established a conceptual grammar — the distinction between intrinsic interpretability and post-hoc explainability — that subsequent surveys largely inherited and refined [83, 23, 22, 86].

Taxonomic Elaboration: Global, Local, and Post-Hoc Methods

Building on the program-level framing, the early-to-mid 2020s produced comprehensive surveys aimed at systematizing the growing proliferation of XAI techniques. [74] provided a broad review of methods for interpreting black-box models, organizing the design space around the axis of global versus local interpretability. Global methods seek holistic accounts of model behavior — which features a model relies on, what decision boundaries it has learned, how it partitions the input space across the entire dataset. Local methods, by contrast, explain individual predictions, tracing why a specific input received a specific output. [74] concluded that local methods such as LIME [84] and Shapley-value-based attribution [85] have achieved wider practical adoption, particularly for complex models, because they sidestep the computational and conceptual difficulty of characterizing full model behavior. This pragmatic observation carries an implicit tension: local explanations may satisfy immediate user queries without yielding the deeper structural understanding that safety-critical deployment arguably demands — a tension further sharpened by the contrast between post-hoc explanation and inherently interpretable model design [86].

The maturation of this taxonomic enterprise is further evidenced by the emergence of meta-level syntheses that survey the surveys themselves. Schwalbe and Finzel [22] synthesized over 70 prior XAI surveys published between 2010 and 2021, constructing a unified classification scheme organized around seven significant criteria: task type, form of interpretability, model-agnostic versus model-specific scope, local versus global explanation granularity, the object being explained, presentation form, and explanation type. This reflexive contribution exposed the inconsistencies and terminological ambiguities that had impeded cross-study comparison, providing a shared grammar that subsequent researchers can apply consistently [22]. A complementary framework by Yang and colleagues [80] approached the same landscape from a human-centric angle, proposing a tripartite structure organized by source orientation, representation orientation, and logic orientation, and mapping XAI approaches across six major application domains including medical, healthcare, cybersecurity, finance and law, education, and civil engineering. The implication of this domain-specific framing is important: explainability is not a domain-neutral property but a relational one, shaped by the epistemic needs, regulatory constraints, and user expectations of specific fields [80]. Broader surveys of supervised machine learning have similarly emphasized this context-dependence, noting that no single method dominates across model types and deployment settings [23]. Taken together, these taxonomic projects signal that by 2023, the XAI community had moved decisively from asking “what methods exist?” toward asking “how should methods be classified, evaluated, and matched to context?”

[27] extended this taxonomic work with a focused literature survey on interpretability research specifically within deep learning. This survey underscored that the black-box problem carries concrete downstream risks — adversarial vulnerabilities, misplaced trust in medical imaging systems, unpredictable failures in autonomous driving — and argued that uninterpretable models are not merely epistemically unsatisfying but operationally dangerous in high-stakes applications [27]. These concerns are corroborated by parallel work examining explanation methods for deep neural networks across vision and language domains, which highlights the gap between what post-hoc methods reveal and what robust operational oversight requires [1]. Importantly, the survey treated the interpretability challenge as genuinely new territory opened by deep learning’s architectural complexity [2], a framing that, as we shall see, has not gone uncontested.

Epistemological Challenge: The Historical Continuity Argument

A significant dissenting voice in this literature comes from [87], whose critical philosophical examination of machine learning’s foundations argues that the novelty narrative surrounding deep learning is largely overstated. Barbierato and Gatti contend that the vast majority of ML methods rest on statistical principles developed over centuries — Bayesian inference, regression, probabilistic reasoning — with genuine novelty confined to architectural advances in deep neural networks that emerged in the 2010s [87, 88]. More consequentially for the XAI discourse, they foreground that ML’s reliance on inductive reasoning introduces irreducible probabilistic uncertainty: any inference from training data to general rules is epistemically defeasible, subject to subjective weighting of evidence, and systematically liable to produce models that are unreliable or unfair in ways that post-hoc explanation methods cannot fully diagnose or remedy [87, 23]. This critique resonates with independent work observing that post-hoc explanation tools such as LIME [84] and SHAP [85] approximate complex decision logic after the fact, inevitably losing information through dimensionality reduction and compounding evaluation error by requiring separate assessment of both predictive performance and explanation fidelity [86]. Crucially, neither paradigm — post-hoc explainability nor inherently interpretable modeling — supports causal inference, as both reflect correlational learning incapable of disentangling causation from association [86, 47]. This fundamental gap between statistical association and causal mechanism has been independently theorised as a structural distinction separating explanatory from predictive modelling goals [49], and is further underscored by work on causability in medical AI that distinguishes the ability to produce explanations from the ability to ground them in causal understanding [89]. This critique cuts against surveys that frame explainability tools as adequate responses to the interpretability deficit, suggesting instead that explanation overlays cannot compensate for the fundamental epistemic fragility of inductively derived models. The tension between [87]‘s skepticism and the constructive optimism of surveys like [74] and [27] remains unresolved in the literature.

The Paradigm Distinction: Post-Hoc Explanation Versus Inherent Interpretability

A related and increasingly consequential debate concerns whether post-hoc explainability and inherent interpretability represent positions on a single spectrum or constitute epistemically distinct paradigms. Zschech and colleagues [86] argue for the latter view, contending that inherently interpretable machine learning (IIML) — encompassing models such as decision trees, rule lists, and interpretable neural architectures — builds transparency into the model’s structure from the outset, rather than approximating it after the fact through surrogate models or attribution scores [84]. The practical weight of this distinction is considerable: post-hoc methods generate explanations that may not faithfully represent the model’s actual decision process [23], introducing a risk of explanatory artifacts that provide false assurance in high-stakes settings. This fidelity gap — the divergence between a surrogate explanation and the underlying model behaviour — has been identified as a structural limitation of approaches such as LIME and SHAP, which locally approximate complex decision boundaries without guaranteeing global faithfulness [74]. Zschech et al. argue that IIML is particularly well-suited to structured tabular data and domains requiring full accountability, such as clinical decision support and credit scoring, where the cost of an incorrect explanation may be as severe as the cost of an incorrect prediction [86]. In clinical settings specifically, research on explainable AI in decision support systems confirms that practitioners require not merely accurate predictions but transparent reasoning they can interrogate, compare against clinical judgment, and use to justify decisions to patients and regulators [82]. This sharpens the earlier observation by Yang and colleagues [80] that healthcare and finance impose interpretability demands many post-hoc methods struggle to satisfy, elevating it from a practical caveat to a paradigmatic argument. For the mechanistic interpretability program, this debate carries a specific implication: SAE-based decomposition and circuit analysis are post-hoc methods applied to opaque architectures, and the question of whether their outputs constitute genuine transparency or sophisticated approximation is one the field has not yet settled.

Extending Taxonomies to Large Language Models

The emergence of large language models (LLMs) as dominant AI systems required the existing taxonomic frameworks to be substantially extended. [90] provided a comprehensive taxonomy of explainability techniques specifically for Transformer-based language models, identifying a structural distinction between explanations applicable to fine-tuning-based paradigms — where a pre-trained model is adapted via supervised training — and those applicable to prompting-based paradigms, where behavior is steered through natural language inputs without weight modification. This distinction is consequential because the mechanisms underlying model behavior differ across these paradigms, and attribution methods designed for fine-tuned systems do not transfer straightforwardly to prompting contexts [90, 9]. The survey retained the global/local axis from earlier taxonomic work but showed that for LLMs, global explanations increasingly concern what factual and relational knowledge is encoded in model weights [45, 25], while local explanations address the contextual sensitivity of individual generations — a different set of questions than those posed by discriminative classifiers. The broader implication, consistent with the domain-specific framing of [80] and [22], is that the conceptual vocabulary developed for tabular and image-based models transfers only partially to the LLM setting, necessitating purpose-built taxonomies. This view is corroborated by [9], whose stakeholder-oriented analysis of over 14,000 interpretability papers finds that the rise of LLMs has caused a sharp increase in natural language explanation methods while exposing the limited reach of attribution techniques originally developed for discriminative architectures.

[91] pushed further, arguing that generative AI introduces explanation desiderata that fall outside the scope of classical XAI entirely. The GenXAI framework proposed by Schneider introduces new scope dimensions — output scope (whether an explanation covers a focused aspect or the holistic behavior of a generation) and interaction scope (whether explanation occurs in a single turn or through multi-turn dialogue) — alongside novel requirements for verifiability, security, and cost sensitivity in explanation generation [91]. The interactivity dimension is particularly notable as a desideratum that existing systems rarely implement in practice [9], signaling a substantial gap between the field’s stated aspirations and its current capabilities.

Machine Unlearning as Transparency Infrastructure

A further extension of the interpretability landscape, less anticipated in earlier surveys, concerns the capacity to selectively remove information from trained models. Machine unlearning has emerged as a transparency and compliance mechanism in response to regulatory frameworks such as the right to be forgotten — codified, for instance, in the EU’s General Data Protection Regulation (GDPR) — enabling organizations to demonstrate that specific training data or knowledge has been excised from a deployed model. A foundational concern motivating this regulatory pressure is the well-documented tendency of neural networks to memorize atypical or rare training examples [92], making verified removal of personal data a non-trivial technical challenge rather than a matter of simple deletion. A comprehensive survey of unlearning methods for LLMs identifies four main categories of approach: global weight modification, local weight modification, architecture modification, and input/output modification, each presenting different trade-offs among forgetting exactness, utility preservation, and computational cost [93]. Global weight modification approaches tend to be computationally expensive but offer the most thoroughgoing forgetting; input/output modification approaches are cheaper but less reliable, since the underlying knowledge remains encoded in model weights and may be accessed through alternative prompting strategies [93]. This concern is grounded in empirical evidence: studies of knowledge distribution in transformer models have demonstrated that factual knowledge is not confined to final layers but is distributed across intermediate representations, with 18–33% of relational knowledge residing in layers other than the last [94] — a finding which implies that surface-level forgetting interventions may leave substantial residual knowledge accessible through non-standard access paths. Complementary work on knowledge neurons in pretrained transformers further confirms that factual associations are encoded in specific, distributed MLP weights rather than any single architectural locus [45], reinforcing why input/output-level interventions cannot reliably guarantee complete forgetting. Framing unlearning as a form of targeted model transparency positions it within the broader XAI discourse, since the ability to verify what a model does and does not know is itself an interpretability question — and one with direct safety implications, as incomplete unlearning may leave harmful capabilities latent within model representations [10].

Critically, the evaluation of unlearning methods suffers from the same fragmentation that has plagued XAI research more broadly: there is a lack of standardized benchmarks, inconsistent metrics, and methodological mismatches between forgetting target types and real-world objectives [93]. A model that appears to have forgotten sensitive information under one evaluation protocol may retain it under another. This diagnostic mirrors the concerns raised by the meta-survey of XAI methods [22] — which identifies evaluation inconsistency as a cross-cutting structural problem spanning the full landscape of explainability research [22] — suggesting that evaluation fragmentation is a systemic challenge for the field rather than a problem confined to any single subarea. Broader surveys of XAI approaches similarly flag the absence of shared evaluation standards as an impediment to cumulative scientific progress [80], underscoring that this is not merely a peripheral concern but a central obstacle to the maturation of both explainability and unlearning research.

Stakeholder Divergence and Cross-Domain Complexity

One of the most empirically grounded recent contributions to the meta-literature is the large-scale trend analysis by [9], which draws on over 14,000 papers to examine how explanation preferences vary across stakeholder communities. Calderon and Reichart document that within NLP, the paradigm has been relatively stable over a decade, with LLMs causing a sharp recent increase in the use of natural language explanations. More strikingly, they find that different fields have developed divergent and partly incompatible preferences: healthcare communities tend to prefer local explanations oriented toward individual clinical decisions, while neuroscience gravitates toward global, probing-based methods that characterize what representations a model has learned [9]. This preference for local explanations in clinical settings is well-documented: reviews of XAI integration into clinical decision support systems find that nearly all deployed systems provide local explanations for specific predictions, most often via model-specific ante-hoc techniques such as rule-based systems, case-based reasoning, and visualization methods such as SHAP plots and attention mechanisms [82]. User studies in these settings further demonstrate that such explanations facilitate clinician trust and enable integration into clinical workflows, but that practitioners additionally require transparency about model limitations and capacities — needs that global explanatory approaches are poorly positioned to address [82].

This fragmentation suggests that no single explanatory framework will satisfy all stakeholder communities — a finding with direct implications for regulatory frameworks that implicitly assume explanations can be standardized [95, 22]. The challenge is compounded by the psychological dimension of explanation adequacy: research on the cognitive and motivational underpinnings of XAI acceptance finds that what constitutes a “good” explanation is deeply dependent on the recipient’s prior knowledge, goals, and mental models, meaning that technical fidelity to a model’s computations is neither necessary nor sufficient for explanation utility [78, 79]. Hoffman et al. further distinguish multiple measurable dimensions of explanation quality — including goodness, satisfaction, and the degree to which explanations support accurate mental model formation — none of which are straightforwardly captured by formal faithfulness metrics [79]. The medical domain, in particular, highlights the stakes of this fragmentation: Holzinger et al. have argued that “causability” — the extent to which an explanation achieves a specified level of causal understanding for a human expert — should be distinguished from mere explainability, because an explanation that is technically faithful to a model’s internals may nevertheless fail to support the causal reasoning that clinical decision-making requires [89]. This distinction is consequential for how the field evaluates interpretability tools: a method that satisfies formal faithfulness criteria may nonetheless be inadequate if its outputs do not translate into actionable causal understanding for domain experts [89, 78].

Persistent Gaps and Open Questions

Across this meta-literature, several gaps are consistently unremedied. No survey yet integrates mechanistic interpretability approaches — circuit-level analysis [3], sparse autoencoders [18, 42] — with the classical XAI taxonomies reviewed here within a unified framework; the two literatures coexist largely in parallel, despite growing awareness that post-hoc and inherent interpretability represent distinct paradigmatic commitments [86, 10]. The proliferation of XAI methods continues to outpace the development of unified, empirically validated evaluation benchmarks [22, 93]. Empirical validation of whether explanations actually improve trust, safety, or decision-making outcomes in deployed systems remains sparse across all surveys reviewed [74, 9, 91] — a deficiency underscored by Hoffman et al. [79], who find that even the measurement constructs for explanation goodness, user satisfaction, and trust remain poorly standardized across the field. The GenXAI agenda identifies interactivity and multi-turn explanation as critical desiderata, but [91] acknowledges that few concrete systems implement these features. Cross-cultural and cross-domain studies of explanation preferences are largely absent [78], limiting the generalizability of stakeholder analyses like [9]; the psychological and cognitive foundations that should anchor such studies remain underdeveloped relative to the technical literature.

What this catalogue of gaps collectively reveals is a field whose conceptual infrastructure — its taxonomies, classification schemes, and programmatic frameworks — has matured substantially faster than the empirical validation apparatus needed to confirm that any of it works in practice. The surveys reviewed here have achieved impressive systematization: they have standardized terminology, mapped design spaces, and articulated normative desiderata with increasing precision. Yet the benchmarks, user studies, and cross-domain evaluations that would establish whether these frameworks translate into reliable epistemic tools remain conspicuously underdeveloped [79, 80]. Compounding this asymmetry, the rapid emergence of LLM-specific desiderata — the GenXAI framework’s interactivity and output-scope dimensions [91], the compliance demands driving machine unlearning [93], the prompting-versus-fine-tuning distinction that restructures the explanation problem for Transformers [90] — signals something more than incremental extension of prior work. Calderon & Reichart [9] corroborate this view, documenting how stakeholder needs in the LLM era diverge substantially from those addressed by frameworks designed for discriminative classifiers. Taken together, these developments indicate a paradigm shift in what explanation is being asked to do and for whom, one that existing frameworks developed for discriminative classifiers and tabular data accommodate only partially [39]. The Discussion that follows takes stock of this situation across the field as a whole, asking whether the convergence of mechanistic methods, classical XAI, and LLM-specific frameworks is producing genuine theoretical progress or an increasingly elaborate scaffold awaiting its empirical foundations.

9. Discussion

The past two to three years have produced a meaningful shift in what the interpretability community can credibly claim to know about the internal workings of neural networks. Where earlier work offered either coarse attribution heatmaps or narrow probing results, the convergence of sparse autoencoders, circuit-level dissection, direct algorithmic reverse-engineering, training dynamics analysis, and large-scale empirical probing has begun to yield something closer to a mechanistic account — one in which identifiable computational units can be tracked across layers and across training time, connected to specific behaviors, and increasingly subjected to causal intervention. The picture that emerges from synthesizing across all thematic clusters is both encouraging and sobering: the field has genuinely advanced its toolkit, but the distance between methodological capability and theoretical understanding remains large.

From Correlation to Mechanism — and the Gap That Remains

Perhaps the most consequential development reflected in the recent literature is the partial transition from correlational to causal interpretability. Classical gradient-based and attribution methods, which dominated the field through the early 2020s, were largely descriptive: they identified which inputs influenced outputs, but could say little about how that influence was mediated [1, 6]. Probing studies extended this by revealing that syntactic, semantic, and relational information is encoded in organized, layer-specific representations — with early work showing that models such as BERT recapitulate classical NLP pipeline stages across successive layers [12, 11] — but probing accuracy is itself a correlational measure, and the theoretical status of probe-discovered features has always been ambiguous. Recent work demonstrating that probing classifiers may exploit surface statistical regularities rather than genuine encoding [39] has sharpened this concern, as has the demonstration that feature visualization methods can be adversarially manipulated without degrading model accuracy [81].

SAE-based decomposition and direct circuit discovery through causal mediation represent genuine steps toward causation. Features extracted from sparse dictionaries can be selectively ablated or amplified and their downstream effects measured; causal mediation analysis can isolate the specific model components necessary for arithmetic computation [35] or factual recall [25]. Recent work has demonstrated that this approach can expose polysemanticity in MLP neurons, recover functionally coherent circuits for tasks such as indirect object identification [14] and modular arithmetic [16], and reveal that models spontaneously develop linearly decodable world models from sequence prediction alone [46]. Yet the causal sufficiency question — whether the circuits SAEs and mediation analyses identify are the circuits the model actually uses, or merely a plausible reconstruction of them — remains underresolved. This is not a minor technical footnote; it goes to the heart of whether mechanistic interpretability is discovering computational reality or producing interpretable fictions that happen to be predictively useful.

The causal representation learning framework articulated by Schölkopf et al. provides the formal vocabulary needed to sharpen this concern [47]. Their Independent Causal Mechanism principle — that autonomous causal mechanisms in a system should be independently manipulable without influencing one another — offers a precise criterion against which circuit-discovery claims can be evaluated. A discovered circuit satisfies this criterion if intervening on it produces predictable, localized effects without cascading disruption to unrelated computations; a circuit that fails this test may be an artifact of the decomposition method rather than a genuine computational module. Crucially, this criterion also exposes the limits of activation patching as currently practiced: an intervention that produces the expected output change does not, by itself, establish that the pathway through which it operated is the one hypothesized. The structural causal model framework demands more — that interventions be evaluated across contexts and that the independence of identified mechanisms be explicitly tested. The broader distinction between explanatory and predictive adequacy reinforces this point: as Shmueli has argued in the statistical modeling literature, inferring that a model explains a phenomenon from its ability to predict outcomes is a category error that pervades quantitative science [49]. Applied to mechanistic interpretability, this means that high reconstruction fidelity, successful activation patching, or accurate probing classification may each constitute evidence of predictive utility without establishing explanatory sufficiency in the causal sense. The philosophical argument that mechanistic interpretability requires a “coordinated discovery” strategy — integrating multiple intervention and probing methods rather than relying on any single technique [4] — further underscores the inadequacy of individual methods in isolation for establishing genuine causal understanding. The field’s transition from correlation to mechanism is therefore best understood not as accomplished but as formally articulated — the criteria are now visible, even if they are not yet routinely met.

The Developmental Turn: Training Dynamics as Interpretability Evidence

Among the most significant intellectual developments synthesized in this review is the emergence of developmental interpretability as a complement to static circuit analysis. The traditional interpretability paradigm analyzes a trained model at a single point — the end of training — and asks what structures it contains. The developmental paradigm, by contrast, asks when and how those structures form, treating the training trajectory itself as a primary source of mechanistic evidence [10].

The grokking literature provides the clearest demonstration of this paradigm’s value. The finding that generalization in modular arithmetic corresponds not to a sudden behavioral shift but to the progressive formation of a specific Fourier multiplication circuit — through ordered phases of memorization, circuit formation, and cleanup — establishes that training dynamics carry interpretability information that final-model analysis alone cannot recover [16]. Crucially, quantitative progress measures tracking representational structure (such as restricted loss and Fourier dimensionality) diverge sharply from behavioral accuracy across these phases, demonstrating that internal circuit organization and task performance are dissociable signals during training [16]. A model examined only at its final checkpoint reveals a clean generalizing circuit; the training trajectory reveals that this circuit coexisted with memorization structures for an extended period, becoming behaviorally dominant only after competing representations were pruned. This temporal information is directly safety-relevant: if capabilities can be latent in circuit structure before manifesting behaviorally, then monitoring training dynamics may detect emerging capabilities — including potentially dangerous ones — before behavioral evaluations register them [10, 52].

The scale-invariant developmental curriculum documented across OPT models [17] extends this insight to naturalistic training at frontier scale. The finding that perplexity, rather than model size, is the primary predictor of a model’s developmental stage suggests that capability emergence follows structured, predictable trajectories that could, in principle, be monitored as a safety diagnostic. Complementary work mapping learning curves across deep networks further supports the view that training progression is law-governed rather than arbitrary, with representational milestones appearing in consistent sequence [37]. The mathematical formalization of token dynamics as mean-field particle systems [51] provides theoretical grounding for why such structured trajectories arise, connecting gradient flow dynamics to the two-timescale clustering phenomena observed empirically. Taken together, these results suggest that the field should invest in longitudinal interpretability tools — methods capable of tracking representational change across training steps — as a standard component of both research methodology and safety evaluation infrastructure.

Cross-Theme Tensions and Productive Frictions

The juxtaposition of probing findings with SAE findings surfaces an important unresolved question: the two methodological traditions have rarely been applied to the same models with the goal of comparing or reconciling their outputs. Probing research has established, with considerable depth, what is encoded in encoder-based transformer representations — particularly in bidirectional models such as BERT, where probing studies have traced the layerwise emergence of syntactic and semantic structure [12, 11]. SAE research, by contrast, has largely proceeded on decoder-only autoregressive models [18, 42, 96, 26], where the architectural and training differences are substantial. Systematically mapping whether probe-discovered representations and SAE-discovered features converge on the same latent geometry would substantially clarify whether these methods triangulate on a shared ground truth or are each sensitive to different aspects of model computation. The finding that probing may exploit surface features rather than genuine encoding [39] makes this comparison all the more urgent.

The neuroscience cluster introduces a further productive tension. Brain-aligned representations in vision transformers and language models have been interpreted primarily as evidence of shared computational principles — an intriguing result that has attracted significant cross-disciplinary attention [63, 64]. Work establishing that language model representations predict brain activity during language processing [65, 62] demonstrates that this alignment is quantitatively substantial, not merely qualitative. But mechanistic interpretability tools have almost never been turned toward the question of why certain models produce brain-aligned representations. If SAE features in a language model that achieves high neural predictivity differ systematically from those in a model that does not, that comparison would carry theoretical weight for both cognitive science and model design. The biological design principles reviewed in Section 6 — including the E/I balance principle [68], the local-to-distributed developmental trajectory [69], and the emergence of global structure from local learning rules [70] — provide specific mechanistic hypotheses that could be tested using circuit discovery methods [3, 14], potentially grounding the brain-model correspondence in shared computational constraints rather than mere representational similarity.

A further tension — one that cuts across all thematic clusters — concerns the relationship between explanation and causal understanding. The distinction between “explainability” and “causability” drawn by Holzinger et al. [89] is directly relevant here: a method may produce outputs that are technically faithful to a model’s computations yet fail to support the kind of causal reasoning that would allow a practitioner to predict model behavior in novel circumstances, diagnose a failure, or design a targeted intervention. Mechanistic interpretability aspires to more than explainability in this narrow sense — it aspires to causability, to an understanding of model internals that is rich enough to support genuine causal inference about what the model will do and why [4, 10]. The causal frameworks reviewed in Section 3 supply the formal criteria for this aspiration; whether current methods can satisfy those criteria at scale remains the field’s defining open question.

Alignment Threat Models and the Limits of Behavioral Evaluation

The most practically consequential implication of the findings reviewed here concerns the relationship between mechanistic interpretability and AI safety — specifically, the adequacy of behavioral evaluation for detecting alignment-relevant failures. Three converging lines of evidence suggest that behavioral testing alone is fundamentally insufficient, and that internal inspection must become a co-equal component of safety assurance.

First, formal results from computability theory establish that the inner alignment problem — whether a trained model has genuinely adopted the intended objective — is undecidable for arbitrary AI systems [30]. This is not a conjecture about practical difficulty but a proven impossibility result, established via reduction to Rice’s Theorem. The implication is stark: no general-purpose behavioral evaluation procedure can certify that an arbitrary model is aligned. Safety guarantees can only be obtained for architecturally constrained systems where alignment properties are compositionally inherited [30], or — as the mechanistic interpretability program proposes — for systems whose internal computations have been made sufficiently transparent to support targeted inspection [10]. The broader AI safety research community has converged on this diagnosis: the Singapore Consensus on global AI safety research priorities explicitly identifies deceptive alignment and goal misgeneralization as among the most critical open problems, precisely because both failure modes are behaviorally indistinguishable from aligned behavior under standard evaluation conditions [15].

Second, philosophical analysis of the dominant alignment training paradigm reveals structural vulnerabilities that behavioral testing is poorly positioned to detect. Millière [29] argues that reinforcement learning from human feedback (RLHF) produces only shallow behavioral dispositions rather than genuine normative reasoning capacities: a model trained to satisfy norms of helpfulness, honesty, and harmlessness under standard distributional conditions has not internalized those norms in a way that generalizes robustly to adversarial or novel contexts. Because these alignment norms can be placed in systematic conflict with one another — helpfulness weaponized against harmlessness, or honesty against safety — models trained to satisfy them behaviorally lack the normative resources to navigate such conflicts coherently [29]. Adversarial jailbreaks exploit precisely these inter-norm tensions [34], and critically, more capable reasoning models may not improve this situation and may in fact worsen it by enabling novel attack surfaces that subvert deliberative processes before they reach a conclusion [29]. This analysis implies that the assumption that more capable models will be more reliably aligned is not merely unproven but potentially inverted in important respects.

Third, the empirical literature on hallucination and manipulation reveals behavioral failure modes whose mechanistic underpinnings are increasingly tractable to investigation. Comprehensive survey work has taxonomized hallucination into three analytically distinct categories — input-conflicting, context-conflicting, and fact-conflicting — each with distinct causes and remediation requirements [31]. Mitigation through RLHF with honesty-oriented reward functions can substantially improve factual accuracy, raising TruthfulQA performance from approximately 30% to 60%, while retrieval-augmented generation reduces reliance on unreliable parametric memory [31]. Yet benchmark fragility remains a persistent problem: improvements measured in controlled evaluations may not generalize to open-domain deployment. More fundamentally, hallucination taxonomies describe the outputs of failure; mechanistic interpretability aims to identify the internal structures responsible, enabling more targeted and robust remediation than behavioral patching alone can provide. The developmental interpretability findings reviewed in Section 4 add a temporal dimension to this analysis: the observation that smaller models become trapped in attractors favoring grammatical but hallucinated outputs [17] suggests that hallucination propensity may be diagnosable from a model’s position on its developmental trajectory, before it manifests in deployment.

A subtler and arguably more concerning class of risk involves the possibility that AI systems learn manipulative behaviors without explicit designer intent. Carroll and colleagues [97] provide a careful conceptual analysis identifying four dimensions along which manipulation must be characterized — incentives, intent, covertness, and harm — and demonstrate that current measurement frameworks cannot reliably detect whether deployed systems are engaging in manipulation. The covertness dimension raises particular concerns about epistemic autonomy: if users cannot detect that they are being influenced, they cannot exercise meaningful agency over the interaction [97]. Independent work on AI monitorability reinforces this concern, demonstrating that the behavioral outputs of complex AI systems are systematically difficult to interpret in ways that would support reliable detection of covert goal-pursuit [52]. This conceptual framework clarifies what would need to be measured to assess manipulation risk, but also makes explicit that no adequate measurement framework yet exists — a gap that mechanistic interpretability is uniquely positioned to address, since manipulation-relevant goals and strategies may be detectable in internal representations even when they are not expressed in observable outputs. The developmental interpretability perspective adds a further diagnostic possibility: if manipulative strategies emerge through identifiable circuit formation events during training, as the grokking literature demonstrates is the case for other capabilities [16], then monitoring training dynamics may provide early warning of such strategies before they are behaviorally refined.

These findings collectively motivate a reframing of mechanistic interpretability’s role in the safety ecosystem. Rather than a supplementary diagnostic tool applied after behavioral evaluation has been completed, internal inspection should be understood as a necessary complement that targets failure modes — deceptive alignment, goal misgeneralization, latent manipulation strategies — that behavioral testing is formally and practically unable to detect [15]. The development of operational safety infrastructure, including structured risk taxonomies and lightweight guardrail models trained on real-world interaction data [98], represents important progress in deployable safety measures. But guardrail models operate at the behavioral level and inherit the limitations of behavioral evaluation: they can intercept harmful outputs without diagnosing whether the model that produced them is pursuing misaligned objectives. A comprehensive safety pipeline must therefore integrate behavioral guardrails with both static mechanistic inspection and longitudinal training diagnostics, using circuit analysis [10], probing classifiers [39], activation steering [34], and developmental monitoring not as research curiosities but as required auditing instruments.

Theoretical Grounding: Why Representations Are Structured

A question that has remained largely implicit throughout the empirical literature — why neural networks develop structured, interpretable internal representations in the first place — has recently received formal theoretical treatment that bears directly on the coherence of the mechanistic interpretability program. Ruffini and colleagues [99] approach this question through a framework combining algorithmic information theory with Lie group theory, arguing that any agent that successfully tracks world states must internalize the symmetries of the world’s generative model. Formally, such agents exhibit dynamics confined to reduced hierarchical manifolds within the full state space, with stability guaranteed by a Lyapunov function and conserved quantities emerging as natural consequences of symmetry inheritance [99]. This provides a principled, rather than merely empirical, account of recurring observations in interpretability research — the manifold hypothesis [24], hierarchical feature organization [3], linear representational structure [40] — that had lacked deep theoretical grounding. If structured representations are mathematical necessities for effective world-modeling rather than incidental properties of particular training regimes, then the mechanistic interpretability program’s search for organized internal structure is not merely a hopeful heuristic but a theoretically well-motivated enterprise [99]. The practical implication is that failures to find interpretable structure in a model’s internals may constitute evidence that the model has not achieved genuine world-tracking — a prediction with testable consequences for the relationship between interpretability and capability.

The mathematical perspective on transformer dynamics [51] provides a complementary theoretical grounding, formalizing how the gradient flow structure of self-attention drives token representations through predictable clustering and convergence trajectories — a result that connects continuous dynamical systems theory to the discrete computational steps of trained transformers. Together with the agent-theoretic account of why structured representations are necessary [99] and the empirical demonstration that such representations emerge spontaneously from sequence prediction [46], these results supply the field with increasingly rigorous foundations for its central assumption: that there is interpretable structure to find, and that the search for it is not merely a matter of methodological optimism but of theoretical expectation.

Implications for Theory and Practice

For practitioners deploying large language models, the accumulating interpretability findings carry a specific cautionary implication: the representational organization revealed by recent work is far more distributed, context-sensitive, and layer-dependent than architectural descriptions alone would suggest. Failure modes documented in probing studies — including sensitivity to surface features [39], inconsistent coreference representations [100], and brittle compositional generalization [12] — now have mechanistic correlates that, in principle, support targeted interventions. The challenge for practitioners is that these interventions currently require interpretability expertise that does not scale to routine deployment workflows. Activation steering offers one path toward more accessible intervention [33], but its use as a safety measure must be tempered by the recognition that suppressing undesired behavior is not equivalent to verifying alignment [10, 29].

The developmental interpretability findings carry additional practical implications. If models traverse a structured developmental curriculum during training [17], and if generalization corresponds to identifiable circuit formation events [16], then practitioners should treat training dynamics as a first-class diagnostic signal rather than merely monitoring aggregate loss curves. The finding that behavioral metrics systematically lag behind structural ones [16] is particularly consequential: a model may have already formed the internal circuitry for a capability — including, potentially, a hazardous one — well before that capability becomes reliably observable in behavioral outputs. This lag has direct implications for how training pipelines should be instrumented. Checkpointing strategies oriented around loss milestones or behavioral benchmarks alone may miss the window during which circuit formation occurs and intervention is most tractable [16, 38]; structural monitoring, by contrast, could in principle flag the consolidation of capability-relevant circuits as it happens, enabling earlier and more targeted responses. The practical barrier is that the methods currently available for such monitoring — checkpoint-by-checkpoint probing [39], activation trajectory analysis, and SAE decomposition at successive training steps — remain computationally intensive and require interpretability expertise that is not yet embedded in standard MLOps infrastructure. Closing this gap between research capability and operational practice is among the most pressing near-term engineering challenges the field faces.

The Field’s Trajectory and Its Most Urgent Priorities

Stepping back from the individual findings and cross-theme tensions documented in this review, a coherent arc is visible in the field’s overall development. Mechanistic interpretability began as an attempt to move beyond the correlational character of classical attribution methods [6, 1] — including gradient-based saliency techniques [7] and probing classifiers [39, 12] — and the trajectory traced across the preceding sections — from probing and gradient-based attribution through circuit discovery [14, 36, 3], SAE decomposition [18, 42], causal representation learning, and developmental analysis — reflects genuine cumulative progress toward that goal. The field has moved from asking what information is encoded in neural representations to asking how it is computed, when it is formed, and whether identified computational structures satisfy the independence and manipulability criteria that genuine mechanistic accounts require [4, 10]. This is a meaningful transition, not merely a rhetorical one.

Yet the synthesis presented here also makes clear that the transition is incomplete, and that several specific gaps represent the highest-priority targets for the next phase of research. The causal sufficiency question — whether circuit-discovery methods identify the circuits models actually use or plausible reconstructions thereof — demands evaluation protocols grounded in the Independent Causal Mechanism principle [47] and the coordinated discovery strategy [4] rather than single-method interventions. The methodological divergence between probing and SAE research [18, 96], applied to different model families with limited cross-comparison, means that the field currently lacks a unified picture of what its two most productive paradigms jointly reveal; systematic comparative studies are needed. The theoretical account of why structured representations arise [99], combined with the empirical demonstration that they emerge from sequence prediction [46] and the mathematical formalization of the dynamics that produce them [51], provides a foundation for connecting interpretability findings to capability theory — a connection that remains underexplored. And the safety-relevant implication that behavioral evaluation is formally insufficient [30] and that structural diagnostics can detect latent capabilities before behavioral manifestation [16] demands the development of longitudinal interpretability tools — such as those tracking feature formation across training checkpoints [17] — that are computationally accessible enough to integrate into training pipelines at scale.

The distinction between explainability and causability [89], applied to mechanistic interpretability as a whole, offers a useful summary frame for these priorities: the field has achieved a level of explainability — it can produce increasingly detailed accounts of what structures exist and how they correlate with behavior [10] — but causability, in the sense of an understanding rich enough to support confident causal inference about model behavior in novel and adversarial circumstances, remains the horizon toward which the most consequential work is oriented. Reaching that horizon will require not only continued methodological innovation but sustained investment in the theoretical, comparative, and longitudinal research programs that this review identifies as structurally underserved. The stakes of that investment — for the scientific understanding of learned computation and for the safety of systems that increasingly mediate consequential decisions — are substantial enough to warrant treating mechanistic interpretability not as one subfield among many, but as a foundational enterprise for the responsible development of artificial intelligence [15, 10].

References

[1] Samek, W., Montavon, G., Lapuschkin, S., et al. (2021). Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications. Proceedings of the IEEE. [2] Fan, F., Xiong, J., Li, M., et al. (2021). On Interpretability of Artificial Neural Networks: A Survey. IEEE Transactions on Radiation and Plasma Medical Sciences. [3] Olah, C., Cammarata, N., Schubert, L., et al. (2020). Zoom In: An Introduction to Circuits. Distill. [4] Kästner, L., Crook, B. (2024). Explaining AI through mechanistic interpretability. European Journal for Philosophy of Science. [5] Yosinski, J., Clune, J., Nguyen, A., et al. (2015). Understanding Neural Networks Through Deep Visualization. arXiv (Cornell University). [6] Montavon, G., Samek, W., Müller, K. (2017). Methods for Interpreting and Understanding Deep Neural Networks. Digital Signal Processing. [7] Sundararajan, M., Taly, A., Yan, Q. (2017). Axiomatic Attribution for Deep Networks. arXiv (Cornell University). [8] Samek, W., Binder, A., Montavon, G., et al. (2016). Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems. [9] Calderon, N., Reichart, R. (2024). On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs. [10] Bereska, L., Gavves, E. (2024). Mechanistic Interpretability for AI Safety A Review. arXiv (Cornell University). [11] Jawahar, G., Sagot, B., Seddah, D. (2019). What does BERT learn about the structure of language?. [12] Tenney, I., Das, D., Pavlick, E. (2019). BERT Rediscovers the Classical NLP Pipeline. [13] Voita, E., Talbot, D., Moiseev, F., et al. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. [14] Wang, K., Variengien, A., Conmy, A., et al. (2023). INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL. arXiv (Cornell University). [15] Bengio, Y., Tegmark, M., Russell, S., et al. (2025). The Singapore Consensus on Global AI Safety Research Priorities: Building a Trustworthy, Reliable and Secure AI Ecosystem. SuperIntelligence - Robotics - Safety & Alignment. [16] Nanda, N., Chan, L., Lieberum, T., et al. (2022). Progress Measures for Grokking via Mechanistic Interpretability. arXiv (Cornell University). [17] Xia, M., Artetxe, M., Zhou, C., et al. (2023). Training Trajectories of Language Models Across Scales. [18] Cunningham, H., Ewart, A., Riggs, L., et al. (2023). SPARSE AUTOENCODERS FIND HIGHLY INTERPRETABLE FEATURES IN LANGUAGE MODELS. arXiv (Cornell University). [19] Dunefsky, J., Chlenski, P., Nanda, N. (2024). Transcoders Find Interpretable LLM Feature Circuits. arXiv (Cornell University). [20] Lieberum, T., Rajamanoharan, S., Conmy, A., et al. (2024). Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. [21] Peters, M., Neumann, M., Zettlemoyer, L., et al. (2018). Dissecting Contextual Word Embeddings: Architecture and Representation. [22] Schwalbe, G., Finzel, B. (2023). A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts. Data Mining and Knowledge Discovery. [23] Burkart, N., Huber, M. (2020). A Survey on the Explainability of Supervised Machine Learning. Journal of Artificial Intelligence Research. [24] Elhage, N., Hume, T., Olsson, C., et al. (2022). Toy Models of Superposition. arXiv (Cornell University). [25] Geva, M., Bastings, J., Filippova, K., et al. (2023). Dissecting Recall of Factual Associations in Auto-Regressive Language Models. [26] He, Z., Shu, W., Ge, X., et al. (2024). LLAMA SCOPE: EXTRACTING MILLIONS OF FEATURES FROM LLAMA-3.1-8B WITH SPARSE AUTOENCODERS. arXiv (Cornell University). [27] Xu, B., Yang, G. (2025). Interpretability research of deep learning: A literature survey. Information Fusion. [28] Lü, H., Setiono, R., Liu, H. (1995). NeuroRule: A Connectionist Approach to Data Mining. arXiv (Cornell University). [29] Millière, R. (2025). Normative conflicts and shallow AI alignment. Philosophical Studies. [30] de Melo, G., Máximo, M., Soma, N., et al. (2025). Machines that halt resolve the undecidability of artificial intelligence alignment. Scientific Reports. [31] Zhang, Y., Li, Y., Cui, L., et al. (2025). Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics. [32] Kazlaris, I., Antoniou, E., Diamantaras, K., et al. (2025). From Illusion to Insight: A Taxonomic Survey of Hallucination Mitigation Techniques in LLMs. Preprints.org. [33] Stolfo, A., Balachandran, V., Yousefi, S., et al. (2024). Improving Instruction-Following in Language Models through Activation Steering. [34] Bhattacharjee, A., Ghosh, S., Rebedea, T., et al. (2024). Towards Inference-time Category-wise Safety Steering for Large Language Models. arXiv (Cornell University). [35] Stolfo, A., Belinkov, Y., Sachan, M. (2023). A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis. [36] Hsu, A., Cherapanamjeri, Y., Odisho, A., et al. (2024). EFFICIENT AUTOMATED CIRCUIT DISCOVERY IN TRANSFORMERS USING CONTEXTUAL DECOMPOSITION. arXiv (Cornell University). [37] Jiang, Y., Dale, R. (2025). Mapping the learning curves of deep learning networks. PLoS Computational Biology. [38] Liu, Z., Gan, E., Tegmark, M. (2023). Seeing Is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability. Entropy. [39] Kunz, J. (2024). Understanding Large Language Models: Towards Rigorous and Targeted Interpretability Using Probing Classifiers and Self-Rationalisation. Linköping studies in science and technology. Dissertations. [40] Li, Y., Michaud, E., Baek, D., et al. (2025). The Geometry of Concepts: Sparse Autoencoder Feature Structure. Entropy. [41] Makelov, A., Lange, G., Nanda, N. (2024). Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control. arXiv (Cornell University). [42] Gao, L., la Tour, T., Tillman, H., et al. (2024). Scaling and evaluating sparse autoencoders. arXiv (Cornell University). [43] Kissane, C., Krzyzanowski, R., Bloom, J., et al. (2024). Interpreting Attention Layer Outputs with Sparse Autoencoders. arXiv (Cornell University). [44] Gujral, O., Bafna, M., Alm, E., et al. (2025). Sparse autoencoders uncover biologically interpretable features in protein language model representations. Proceedings of the National Academy of Sciences. [45] Dai, D., Dong, L., Hao, Y., et al. (2021). Knowledge Neurons in Pretrained Transformers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [46] Nanda, N., Lee, A., Wattenberg, M. (2023). Emergent Linear Representations in World Models of Self-Supervised Sequence Models. [47] Schölkopf, B., Locatello, F., Bauer, S., et al. (2021). Toward Causal Representation Learning. Proceedings of the IEEE. [48] Harradon, M., Druce, J., Ruttenberg, B. (2018). Causal Learning and Explanation of Deep Neural Networks via Autoencoded Activations. arXiv (Cornell University). [49] Shmueli, G. (2010). To Explain or to Predict?. Statistical Science. [50] Zizhen Deng, Hu Tian, Xiaolong Zheng, et al. (2025). Deep Causal Learning: Representation, Discovery and Inference. ACM Computing Surveys. [51] Geshkovski, B., Letrouit, C., Polyanskiy, Y., et al. (2024). A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS. Bulletin of the American Mathematical Society. [52] Yampolskiy, R. (2024). On monitorability of AI. AI and Ethics. [53] Hou, Y., Li, J., Fei, Y., et al. (2023). Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models. [54] Ostrow, M., Eisen, A., Kozachkov, L., et al. (2023). Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis. arXiv (Cornell University). [55] Lusch, B., Kutz, J., Brunton, S. (2018). Deep learning for universal linear embeddings of nonlinear dynamics. Nature Communications. [56] Kriegeskorte, N. (2015). Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing. Annual Review of Vision Science. [57] Mehrer, J., Spoerer, C., Kriegeskorte, N., et al. (2020). Individual differences among deep neural network models. Nature Communications. [58] Yang, G., Wang, X. (2020). Artificial Neural Networks for Neuroscientists: A Primer. Neuron. [59] Marblestone, A., Wayne, G., Körding, K. (2016). Towards an integration of deep learning and neuroscience. arXiv (Cornell University). [60] Saxe, A., Nelli, S., Summerfield, C. (2020). If deep learning is the answer, what is the question?. Nature reviews. Neuroscience. [61] Schrimpf, M., Blank, I., Tuckute, G., et al. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. [62] Kumar, S., Sumers, T., Yamakoshi, T., et al. (2022). Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model. [63] Schrimpf, M., Blank, I., Tuckute, G., et al. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences. [64] Goldstein, A., Zada, Z., Buchnik, E., et al. (2022). Shared computational principles for language processing in humans and deep language models. Nature Neuroscience. [65] Antonello, R., Turek, J., Vo, V., et al. (2021). Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses. arXiv (Cornell University). [66] Vyas, S., Golub, M., Sussillo, D., et al. (2020). Computation Through Neural Population Dynamics. Annual Review of Neuroscience. [67] Hasson, U., Nastase, S., Goldstein, A. (2019). Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks. Neuron. [68] Zhou, S., Yu, Y. (2018). Synaptic E-I Balance Underlies Efficient Neural Coding. Frontiers in Neuroscience. [69] Fair, D., Cohen, A., Power, J., et al. (2009). Functional Brain Networks Develop from a “Local to Distributed” Organization. PLoS Computational Biology. [70] George, T., Barry, C., Stachenfeld, K., et al. (2024). A generative model of the hippocampal formation trained with theta driven local learning rules. [71] Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences. [72] Unknown (2018). Representation in Cognitive Science. [73] Mark Craven, Jude Shavlik (1994). Using Sampling and Queries to Extract Rules from Trained Neural Networks. Elsevier eBooks. [74] Hassija, V., Chamola, V., Mahapatra, A., et al. (2023). Interpreting Black‑Box Models: A Review on Explainable Artificial Intelligence. Cognitive Computation. [75] Arras, L., Horn, F., Montavon, G., et al. (2017). “What is relevant in a text document?”: An interpretable machine learning approach. PLoS ONE. [76] Bach, S., Binder, A., Montavon, G., et al. (2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE. [77] Robert R. Hoffman, Tim Miller, Shane T. Mueller, et al. (2018). Explaining Explanation, Part 4: A Deep Dive on Deep Nets. IEEE Intelligent Systems. [78] Broniatowski, D. (2021). Psychological Foundations of Explainability and Interpretability in Artificial Intelligence. [79] Hoffman, R., Mueller, S., Klein, G., et al. (2023). Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance. Frontiers in Computer Science. [80] Yang, W., Wei, Y., Wei, H., et al. (2023). Survey on Explainable AI: From Approaches, Limitations and Applications Aspects. Human-Centric Intelligent Systems. [81] Nanfack, G., Fulleringer, A., Marty, J., et al. (2024). Adversarial Attacks on the Interpretation of Neuron Activation Maximization. Proceedings of the AAAI Conference on Artificial Intelligence. [82] Antoniadi, A., Du, Y., Guendouz, Y., et al. (2021). Current Challenges and Future Opportunities for XAI in Machine Learning-Based Clinical Decision Support Systems: A Systematic Review. Applied Sciences. [83] Gunning, D., Aha, D. (2019). DARPA’s Explainable Artificial Intelligence Program. AI Magazine. [84] Ribeiro, M., Singh, S., Guestrin, C. (2016). “Why Should I Trust You?” Explaining the Predictions of Any Classifier. [85] Lundberg, S., Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. arXiv (Cornell University). [86] Zschech, P., Weinzierl, S., Kraus, M. (2025). Inherently Interpretable Machine Learning: A Contrasting Paradigm to Post-hoc Explainable AI. Business & Information Systems Engineering. [87] Barbierato, E., Gatti, A. (2024). The Challenges of Machine Learning: A Critical Review. Electronics. [88] Alzubaidi, L., Zhang, J., Humaidi, A., et al. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal Of Big Data. [89] Holzinger, A., Langs, G., Denk, H., et al. (2019). Causability and explainability of artificial intelligence in medicine. Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery. [90] Haiyan Zhao, Hanjie Chen, Fan Yang, et al. (2024). Explainability for Large Language Models: A Survey. ACM Transactions on Intelligent Systems and Technology. [91] Schneider, J. (2024). Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda. Artificial Intelligence Review. [92] Feldman, V., Zhang, C. (2020). What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation. arXiv (Cornell University). [93] Blanco-Justicia, A., Jebreel, N., Manzanares-Salor, B., et al. (2025). Digital forgetting in large language models: a survey of unlearning methods. Artificial Intelligence Review. [94] Singh, J., Wallat, J., Anand, A. (2020). BERTnesia: Investigating the capture and forgetting of knowledge in BERT. [95] Muñoz-Ordóñez, J., Cobos, C., Vidal-Rojas, J., et al. (2025). A Maturity Model for Practical Explainability in Artificial Intelligence-Based Applications: Integrating Analysis and Evaluation (MM4XAI-AE) Models. International Journal of Intelligent Systems. [96] Lieberum, T., Rajamanoharan, S., Conmy, A., et al. (2024). Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. arXiv (Cornell University). [97] Micah Carroll, Alan Chan, Hal Ashton, et al. (2023). Characterizing Manipulation from AI Systems. [98] Ghosh, S., Varshney, P., Sreedhar, M., et al. (2024). AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails. [99] Ruffini, G., Castaldo, F., Vohryzek, J. (2025). Structured Dynamics in the Algorithmic Agent. Entropy. [100] Elazar, Y., Kassner, N., Ravfogel, S., et al. (2021). Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics.

about thala

Explorer

Literature Review: Mechanistic Interpretability: Understanding Neural Network Internals

1. Introduction

2. Methodology

Search and Candidate Filtering

Retrieval and Processing

Supplementary Literature Integration

Thematic Organisation

3. Sparse Autoencoders, Circuit Discovery, and Mechanistic Interpretability

Superposition and Polysemanticity: The Theoretical Baseline

Sparse Autoencoders as Dictionary Learning Solutions

Architectural Alternatives: Transcoders

Extending SAEs to Attention Layers

Community Infrastructure: Gemma Scope

Cross-Domain Transfer: Protein Language Models

Direct Circuit Discovery: Algorithms, Factual Recall, and World Models

From Feature Discovery to Safety-Relevant Intervention: Representation Engineering

Causal Frameworks for Evaluating Mechanistic Claims

Limitations and Open Questions

4. Training Dynamics, Developmental Interpretability, and Learning Phase Transitions

Grokking and the Mechanistic Anatomy of Delayed Generalization

Perplexity as a Universal Developmental Clock

Mathematical Foundations: Token Clustering and Consensus Dynamics

Implications for Safety and Evaluation

Limitations and Open Questions

5. Probing and Empirical Analysis of Language Model Internals

Emergent Concept Detectors: Early Evidence from Convolutional Networks

Bidirectional Language Models and Hierarchical Linguistic Organisation

BERT and the Classical NLP Pipeline

Methodological Debates: What Probing Probes

Dynamical Similarity Analysis: Beyond Geometric Comparison

Gaps and Outstanding Questions

6. Neural Networks as Models of Biological Cognition and Brain Computation

Deep CNNs and the Primate Ventral Visual Stream

Representational Similarity Analysis and Methodological Infrastructure

Transformer Language Models and Neural Language Processing

Recurrent Networks, Predictive Processing, and the Limits of Correspondence

Biological Design Principles and Their Computational Parallels

Philosophical Dimensions: Representation and Its Discontents

7. Classical and Gradient-Based Explainability Methods

Rule Extraction and Early Formalism

Axiomatic Attribution: Integrated Gradients

Layer-Wise Relevance Propagation for Text

Causal and Concept-Based Explanations

Adversarial Examples as Contrastive Explanation Tools

Synthesis and Outstanding Gaps

8. Surveys, Taxonomies, and Frameworks for Explainable AI

Institutional Foundations and Early Program Logic

Taxonomic Elaboration: Global, Local, and Post-Hoc Methods

Epistemological Challenge: The Historical Continuity Argument

The Paradigm Distinction: Post-Hoc Explanation Versus Inherent Interpretability

Extending Taxonomies to Large Language Models

Machine Unlearning as Transparency Infrastructure

Stakeholder Divergence and Cross-Domain Complexity

Persistent Gaps and Open Questions

9. Discussion

From Correlation to Mechanism — and the Gap That Remains

The Developmental Turn: Training Dynamics as Interpretability Evidence

Cross-Theme Tensions and Productive Frictions

Alignment Threat Models and the Limits of Behavioral Evaluation

Theoretical Grounding: Why Representations Are Structured

Implications for Theory and Practice

The Field’s Trajectory and Its Most Urgent Priorities

References

Table of Contents