Literature Review: Multimodal Grounding: Vision, Language, and the Symbol Grounding Problem

This systematic review synthesizes evidence from 58 studies (1990-2025) on whether visual grounding enables genuine linguistic understanding in AI, addressing the symbol grounding problem. It finds that while visual-linguistic alignment is learnable at scale, it remains statistically fragile and uneven, most benefiting concrete concepts while failing to support abstract reasoning. Critically, no evidence confirms that multimodal training yields understanding categorically superior to unimodal language models. Instead, systematic grounding failures, particularly in compositional generalization, are reframed as key diagnostics revealing the gap between associative learning and structured conceptual knowledge. Mechanistic analyses trace these failures to specific architectural bottlenecks, such as text encoder compression and perceptual entanglement, rather than undifferentiated shortcomings. The review concludes that the field urgently needs evaluation frameworks combining behavioral diagnostics with mechanistic interpretability to distinguish genuine comprehension from pattern matching, advocating that grounding failures be treated as primary research signals.

1. Introduction

The question of whether computational systems can genuinely understand the words they process has occupied philosophers, cognitive scientists, and artificial intelligence researchers for decades. This issue gained formal articulation in the early 1990s through the symbol grounding problem [1, 2], which asked how the internal symbols of a cognitive or computational system could acquire meaning rather than remaining an unanchored formal calculus. At that time, the problem was largely theoretical, posing a challenge against purely symbolic approaches to artificial intelligence and cognition. Today, the same question returns with renewed urgency and considerably higher practical stakes. The emergence of large-scale vision-language models — systems trained on paired images and text across hundreds of millions or billions of examples [3, 4] — has produced capabilities that superficially suggest genuine semantic understanding, yet has simultaneously generated empirical anomalies that resist easy interpretation. Whether visual experience, even in the attenuated statistical form available to these systems, is sufficient to ground linguistic meaning remains one of the most contested and consequential open questions at the intersection of cognitive science and machine learning [5, 6]. This systematic review examines fifty-eight studies spanning 1990 to 2025 to synthesize what is currently known, what remains unresolved, and what the field’s own failures reveal about the nature of machine understanding.

The backdrop against which this review is written is distinctive. For most of the period under review, vision and language were treated as largely separate research domains, each with its own benchmarks, architectures, and theoretical commitments. Within the past several years, that separation has collapsed. Transformer-based architectures have enabled a new class of models that process images and text within unified representational frameworks [7, 8], and the benchmarks designed to evaluate them have proliferated correspondingly [9, 10]. The resulting literature is simultaneously technically advanced and theoretically underspecified. Impressive performance on visual question answering [9], image captioning, and cross-modal retrieval coexists with systematic failures on tasks that would seem trivially easy for any agent with genuine perceptual understanding [11, 12]. This tension — between capability and comprehension, between benchmark success and conceptual opacity — is the organizing problem of the present review. This tension is further sharpened by a parallel body of work on compositional generalization, which demonstrates that neural models routinely achieve near-perfect in-distribution accuracy while failing catastrophically on novel structural combinations [13, 14]. This pattern recurs with striking regularity across both linguistic and multimodal settings [15], providing formal diagnostic tools for understanding the specific nature of grounding failures. A convergent strand of mechanistic interpretability research, applying probing classifiers [16], modality contribution analysis [17], and representational diagnostics to vision-language models, has begun to reveal the specific architectural loci at which grounding breaks down — transforming the interpretation of failure from behavioural observation to causal explanation.

Four research questions guide this synthesis. First, how do multimodal models learn to align visual and linguistic representations, and what computational and architectural choices govern the quality of that alignment [4]? Second, does visual grounding change how models process language, and does any such change differ between concrete and abstract vocabulary [5, 18]? Third, what evidence, taken across paradigms and methods, bears on whether multimodal training produces understanding that is qualitatively different from that achieved through language alone? Fourth, what do grounding failures in multimodal systems reveal about the nature and limits of model knowledge [12, 11]?

The review is organized around six thematic areas. The first addresses the theoretical and cognitive science foundations of the symbol grounding problem [1, 19], including embodied and enactivist critiques that have complicated the original formulation [6, 20]. The second examines robotic and embodied AI implementations [2], where grounding has been pursued not through statistical association but through sensorimotor interaction with physical environments. The third theme turns to multimodal distributional semantics [18], analyzing how the incorporation of visual and acoustic signals into linguistic representations changes the geometry and interpretability of semantic space. The fourth covers the technical infrastructure of the vision-language field — architectures, pretraining strategies, datasets, and evaluation benchmarks [3, 4] — as a necessary foundation for interpreting the empirical findings that follow. The fifth and most critically pointed theme examines grounding failures: spurious correlations, compositional errors [15], systematic misalignments between model behavior and putative understanding [10], and — drawing on mechanistic interpretability research [17, 16] — the representational and architectural bottlenecks that underlie these failures. The sixth addresses domain-specific applications in high-stakes settings where the difference between surface-level competence and genuine grounding carries practical consequences.

The scope of this review is deliberately cross-disciplinary. Philosophy of mind, cognitive linguistics, computational neuroscience, robotics, and modern deep learning all bear on the central questions, and no single disciplinary vantage point is sufficient [19, 21]. The review does not adjudicate between competing philosophical positions on consciousness or intentionality, but it does take seriously the empirical constraints that recent machine learning research places on theoretical claims. What this moment in the literature demands is not another empirical contribution but a synthetic account of where the field stands: what has been established, what has been assumed without adequate examination, and what the accumulating evidence of model failure obliges researchers to reconsider.

2. Methodology

The methodological choices that shape a synthetic review are not merely procedural: they determine which evidence is brought to bear on which questions, and therefore which conclusions are available. For a topic as conceptually contested as multimodal grounding — where the same experimental result can be read as evidence for or against genuine understanding depending on the theoretical framework applied — the integrity of the corpus matters as much as the quality of individual studies. The decisions documented here were accordingly made to serve analytical coherence rather than retrieval volume.

Search and Candidate Filtering

The literature was identified through a structured search of the OpenAlex database, guided by five targeted queries designed to capture the principal conceptual threads of the topic. These queries addressed multimodal grounding as a survey domain [18, 4], visual-linguistic representation alignment in neural architectures [8, 7], cross-modal learning and grounding failures [11, 10], multimodal transformer approaches to visual and abstract language processing [3, 4], and the symbol grounding problem as situated within broader artificial intelligence and multimodal systems research [1, 19]. The queries were constructed to triangulate the intersection of philosophical and computational concerns that define this field, rather than to exhaustively retrieve all adjacent literature — a deliberate prioritisation of conceptual precision over retrieval breadth.

The initial search returned 197 candidate papers, from which 30 were retained following relevance scoring against a threshold of 0.6. This threshold was chosen to ensure that retained papers bore substantive rather than incidental relationships to the core review topic. Citation network expansion was also attempted but yielded no additional papers beyond those already identified through keyword search; the final corpus therefore reflects keyword-based discovery exclusively.

Quality filtering balanced scholarly weight with currency. Older papers were required to have accumulated a minimum of five citations, providing a baseline indicator of community reception; a recency quota simultaneously stipulated that at least 35% of the final corpus derive from within a two-year window, ensuring responsiveness to the rapid pace of development in multimodal learning research [3, 4]. The date range of retained papers spans 1990 to 2025 — a breadth that reflects both the historical depth of the symbol grounding problem as a theoretical concern [1, 2] and its contemporary reframing in neural and transformer-based architectures [4, 8].

Supplementary Corpus Expansions

Two targeted supplementary expansions were undertaken to address interpretive gaps that emerged during full-text analysis, with each expansion motivated by a specific analytical need.

The first expansion became necessary when it became apparent that grounding failures documented across the initial corpus could not be adequately interpreted without reference to the parallel literature on compositional and systematic generalization in neural networks. This body of work — spanning controlled linguistic benchmarks [13, 14], formal analyses of transformer expressivity [22], meta-learning approaches to systematicity [23], and neuro-symbolic methods [24] — provides the formal framework for diagnosing which specific compositional capacities multimodal models lack, rather than treating grounding failure as an undifferentiated phenomenon. For instance, benchmarks such as COGS [13] expose systematic gaps between in-distribution and out-of-distribution compositional performance, while analyses of transformer architectures reveal that structural biases — such as attention constraints and positional encodings — must often be explicitly engineered to support generalization [22]. Complementary work on algebraic recombination further clarifies how systematic compositionality can be scaffolded through learned structural operations [25]. A targeted search yielded 12 additional papers addressing compositional generalization benchmarks, architectural analyses of systematic generalization, in-context learning [26], neuro-symbolic methods, and the cognitive science foundations of compositionality [19, 5].

The second expansion targeted the emerging literature on mechanistic interpretability and representational probing of vision-language models. While the compositional generalization literature clarifies what models fail to do, this body of work — encompassing probing classifiers [16], representational similarity analysis [27, 18], modality contribution metrics [17], and controlled diagnostic benchmarks [10] — addresses why failures occur and where in model architectures they originate. Mechanistic interpretability methods have proven particularly valuable for localizing representational failures to specific layers and attention heads [28]. Recent work has further shown that text encoders themselves can bottleneck compositional alignment in contrastive vision-language models [15], and that large vision-language models exhibit systematic robustness failures tied to fundamental visual variations [11]. A targeted search yielded 15 additional papers spanning mechanistic probing of vision-language model representations, theoretical foundations of meaning and embodied cognition in the neural era, vision-language pre-training architectures and contrastive learning frameworks [7, 8], and cross-modal alignment in downstream applications. Both supplementary corpora were subjected to the same full-text analysis and thematic assignment procedures as the primary corpus. The combined corpus comprises 58 studies.

Retrieval and Processing

All papers selected for inclusion underwent full-text analysis; no paper was processed from metadata or abstract alone. Three papers from the initial pool proved unretrievable, and in each case a replacement was drawn from the remaining candidates, preserving the intended corpus size and ensuring that analytical coverage was not reduced.

Thematic Organisation and Relationship to the Research Questions

The combined corpus was organised into six thematic clusters following full-text analysis, ensuring that thematic assignment reflected each paper’s substantive contribution rather than surface-level keyword overlap. This structure maps deliberately onto the review’s four research questions.

The clusters addressing theoretical foundations and embodied implementations speak primarily to the question of what genuine grounding requires. These clusters draw on philosophical and cognitive science criteria spanning grounded cognition [6], the symbol grounding problem [1], and symbol emergence in developmental and robotic systems [19].

The clusters covering multimodal distributional semantics and the technical infrastructure of vision-language systems address how representation alignment is achieved and what architectural choices govern its quality [4, 15].

The cluster on grounding failures — the most analytically central — bears directly on both the qualitative comparison between multimodal and unimodal training and the question of what failure reveals about the limits of model knowledge. This includes systematic benchmarking of where vision-language models fall short [10] and whether such models sustain coherent internal world representations [12].

Finally, the domain-applications cluster examines what is at stake when these limits go unacknowledged in practice [11].

Section 3 begins with the theoretical foundations, establishing the philosophical and cognitive science criteria against which subsequent empirical findings will be assessed.

3. The Symbol Grounding Problem: Theoretical Foundations and Cognitive Science Perspectives

The question of how symbols acquire meaning stands as one of the most enduring puzzles at the intersection of cognitive science, philosophy of mind, and artificial intelligence. At its core, this theme interrogates whether semantic content can emerge from syntactic manipulation alone, or whether meaning is irreducibly anchored to the embodied, sensorimotor experience of agents inhabiting a physical world. The papers surveyed here span a remarkable intellectual arc — from classical philosophical critique to computational robotics — yet remain bound by shared stakes: what must a system possess in order to genuinely understand, rather than merely process, the symbols it manipulates?

Classical Grounding and the Insufficiency of Formal Computation

The foundational provocation for this entire literature was articulated by [29], who formalized what has since become known as the symbol grounding problem. Harnad’s central argument was deceptively simple but philosophically devastating: in a physical symbol system, symbols are defined only in terms of other symbols, producing what he characterised as an ungrounded, “merry-go-round” of circular reference. No amount of syntactic manipulation, however sophisticated, could on its own establish a genuine relationship between an abstract token and the real-world entity it purportedly denotes. The implications for classical AI were stark — a system that passes a Turing test by producing the right string outputs has not thereby demonstrated understanding [30]; meaning must be grounded somewhere outside the symbol system itself. Harnad proposed that grounding could be achieved through iconic and categorical representations derived from sensorimotor transduction [19], effectively insisting that perception must precede and underwrite symbol use. This proposal has since been elaborated in robotic and developmental systems [1, 2], where the challenge of anchoring symbols to perceptual signals has proven considerably more difficult in practice than Harnad’s outline suggested.

Harnad’s critique addressed the referential dimension of the symbol grounding problem — how individual symbols acquire denotational content. A parallel and equally influential challenge, articulated by Fodor and Pylyshyn in the late 1980s and recently engaged with renewed empirical and formal precision [22, 23, 31], targeted the compositional dimension: whether the distributed representations of connectionist systems could support the rule-governed recombination that characterises human linguistic productivity. A system that can process “John loves Mary” must, by virtue of genuine compositional capacity, also process “Mary loves John” — and this systematicity, they argued, requires structured representations rather than holistic associative patterns. Contemporary benchmarks designed to probe exactly this capacity — such as compositional generalisation tasks [13, 14] — confirm that even state-of-the-art neural architectures exhibit systematic failures when required to generalise to novel combinations of familiar constituents. The two critiques are logically independent but converge on a shared conclusion: that the appearance of linguistic competence in a computational system is insufficient evidence for genuine understanding [30], whether the deficit concerns how individual symbols acquire meaning or how meaningful symbols combine. Addressing the symbol grounding problem therefore requires satisfying both referential and compositional constraints — a dual requirement that, as subsequent sections will demonstrate, no existing multimodal architecture has fully met [15, 12].

This combined critique resonated broadly, but the mechanisms sketched by both Harnad and his interlocutors remained underspecified. How, precisely, does sensorimotor experience get recruited into higher-level conceptual thought [32], and how does compositional structure emerge in representations grounded in continuous perceptual signals? These questions set the stage for subsequent embodied cognition frameworks to provide more detailed mechanistic answers.

Contemporary Restatements: The Form-Meaning Distinction in the Neural Era

The classical grounding critiques originally targeted symbolic AI systems whose internal representations consisted of discrete tokens with designer-assigned meanings. The rise of large neural language models, trained on vast corpora through self-supervised objectives, might seem to sidestep these concerns by learning representations from data rather than receiving them by stipulation. However, [30] argued forcefully that this shift does not resolve the fundamental problem. Drawing a principled distinction between form — the observable, distributional patterns of linguistic tokens — and meaning, which they characterize as the relation between form and communicative intent anchored in a shared external world, Bender and Koller contended that a system trained exclusively on the statistical co-occurrence of linguistic symbols operates entirely within the formal domain. Such a system, they argued, has no principled mechanism for reaching beyond form to the referential and intentional content that constitutes genuine meaning. Their octopus thought experiment — in which an agent intercepts and learns to mimic human linguistic exchanges without any access to the communicative situations those exchanges describe — was designed to demonstrate that apparent communicative competence can be entirely attributed to the interpretive charity of human interlocutors rather than to genuine comprehension by the system [30]. This argument extends Harnad’s merry-go-round diagnosis into the specific empirical context of contemporary neural models: where Harnad showed that symbols defined only in terms of other symbols remain ungrounded [1, 19], Bender and Koller demonstrated that statistical patterns over symbol distributions remain equally ungrounded, however sophisticated the pattern extraction may be.

These arguments have direct implications for the present review. If meaning is not recoverable from form alone, then behavioral benchmarks measuring linguistic fluency or task performance cannot, by themselves, serve as evidence of understanding. Instead, such benchmarks may reflect the capacity to exploit statistical regularities in ways that satisfy human evaluators operating under default interpretive generosity [30]. This concern is borne out by systematic benchmark analyses: task-independent evaluations of vision-and-language models centered on core linguistic phenomena have found that even high-performing systems fail reliably on grounding-sensitive probes [10]. Furthermore, text encoders in contrastive multimodal architectures impose a systematic bottleneck on compositional grounding [15]. This reframing converts many apparently positive empirical results into ambiguous findings whose significance depends on deeper theoretical commitments about the grounding conditions for meaning. Crucially, the form-meaning distinction does not preclude the possibility that multimodal training — which pairs linguistic form with perceptual input — might supply the missing referential anchor. However, it does establish that the burden of proof falls on multimodal systems to demonstrate that visual co-training produces something more than richer statistical form [5]. This challenge, as the grounding failures documented in Section 7 suggest, has not yet been met.

Embodied Simulation as Cognitive Mechanism

By the early 2010s, embodied cognition research had matured sufficiently to offer not just philosophical critiques of classical AI but positive theoretical and computational proposals. [20] provided one of the most programmatically clear articulations of this agenda. Through a structured dialogue format, Pezzulo, Barsalou, and Cangelosi argued that embodied computational models must employ modal rather than amodal representations — representational formats that retain structural features of the perceptual and motor states from which they derive [6]. Crucially, these authors proposed simulation — the partial re-enactment of prior sensorimotor states — as the central mechanism bridging immediate perception and higher cognitive functions such as language comprehension, reasoning, and imagination. According to this view, understanding the word “kick” does not merely activate an abstract node in a semantic network; it recruits motor cortex representations associated with leg movement. This claim is supported by substantial neuroimaging and behavioral evidence demonstrating that action-word comprehension reliably engages sensorimotor cortices in a somatotopically organized manner [32, 33]. More recent consensus work confirms that embodied processing extends across multiple levels of linguistic granularity, from single words to discourse [34]. This framework thus deepened and operationalized Harnad’s original intuition, offering a specific mechanism rather than only a diagnosis.

A more radical extension of the embodied program was developed by [35], who integrated the Free Energy Principle and Active Inference framework with hierarchical predictive processing. This approach argues that genuine cognition requires not merely modal representations but embodied self-models — physically instantiated free energy gradients constitutively tied to an organism’s bodily states and agentic capacities. The broader Active Inference framework positions the brain as a prediction machine that minimizes surprise through cycles of action and perception [36, 37]. [35] pushes this account furthest by grounding consciousness and intentionality in self-referential somatic maps. On this account, beliefs and desires are not abstract representational states but patterns of probabilistic prediction and prediction-error that are scaffolded by somatic and motor processes. Consciousness itself emerges from self-referential body maps functioning as cybernetic controllers [35].

The consequences for assessing artificial systems are demanding: a vision-language model that processes perceptual and linguistic inputs without any mechanism analogous to sensorimotor grounding through cycles of action and perception would, on this framework, lack the substrates necessary for genuine intentional states. Empirical evaluations of such models bear this out — benchmarks probing compositional and physically grounded understanding reveal systematic failures consistent with the absence of active, embodied interaction [12, 10]. While multimodal training brings such systems closer to world-anchored representations than purely linguistic models, whether static perceptual inputs — absent the temporal dynamics of action, proprioception, and environmental feedback — are sufficient to replicate the grounding conditions that active inference requires remains an open and pressing question. This perspective resonates with the enactivist commitments of the robotics literature examined in Section 4, while establishing a more precise mechanistic account of why passive statistical association over image-text pairs may be structurally insufficient for grounding.

The Neuroimaging Challenge to Pure Embodiment

The embodied consensus faced significant complication from neuroimaging evidence marshalled in defence of a more abstractionist position. [38] conducted a meta-analysis of 87 neuroimaging studies and identified a consistent, left-lateralised cortical network activated by semantically meaningful stimuli regardless of sensory modality. Binder argued that this network — concentrated in heteromodal association areas including the angular gyrus, posterior middle temporal gyrus, and medial prefrontal cortex — instantiates what he termed crossmodal conjunctive representations (CCRs): abstract representational structures that encode conceptual similarity and support thematic, associative relationships across domains. According to this account, the brain does not simply replay sensorimotor simulations when processing concepts; rather, it accesses genuinely amodal conjunctive codes that have been abstracted away from any particular modality. Binder’s findings placed him in direct tension with [20], whose modal simulation framework resists the idea that meaning can be captured by representations detached from perceptual format. Broader consensus reviews have likewise acknowledged this tension, noting that embodied and disembodied processing likely co-occur across distinct cortical systems [33, 34].

This disagreement extends beyond mere terminology. If Binder is correct, then grounding in sensorimotor experience may be a developmental precondition for acquiring abstract representations rather than a constitutive feature of those representations once formed. This position has been developed most explicitly by [39], who argues that amodal representations become computationally necessary as cognition scales beyond perceptual particulars. Conversely, if Pezzulo and colleagues are correct, then the brain’s heteromodal regions are better understood as sites of multimodal convergence preserving modal structure, not as repositories of amodal abstraction [21]. The debate remains unresolved and marks perhaps the sharpest theoretical fault line in the field. More recently, [40] has proposed a mechanistic neural architecture in which cell assembly computation, coordinated through neural oscillations at multiple timescales, provides the biological substrate for converting distributed perceptual representations into hierarchical compositional structures. This proposal offers a potential bridge between the modal simulation framework of [20] and the formal requirements of compositional language processing, while highlighting that the neural implementation of compositionality involves continuous dynamical processes rather than the discrete operations assumed by classical symbolic accounts [41].

Reconciliation and the Distributional Bridge

Recognising that both embodied and distributional accounts had captured genuine phenomena, [5] proposed a synthesis that reframed the opposition itself. Andrews, Frank, and Vigliocco argued that embodied and distributional approaches to meaning are fundamentally complementary rather than genuinely competing paradigms. Their key conceptual move was to treat language as a physical environment that agents perceive and act upon — an entity in the world rather than merely a transparent window onto concepts. On this construal, distributional processing of linguistic co-occurrence patterns is itself a form of embodied cognition, subject to the same principles of sensorimotor engagement that Pezzulo and colleagues emphasise [20, 21]. This reframing has proven influential, but it arguably sidesteps rather than resolves the deeper question of whether abstract meaning ultimately reduces to, or merely correlates with, sensorimotor grounding. Dove [39], for instance, argues that language constitutes an independent “dis-embodied” representational system — embodied neurophysiologically but accruing meaning through symbolic association rather than sensorimotor content directly — a challenge that no purely reconciliationist account has fully answered. The tension between [38]‘s amodal CCRs and [20]‘s modal simulations persists even within a reconciliationist framework.

Symbol Emergence: Developmental and Semiotic Perspectives

A distinct but complementary line of critique targets the designer-assigned character of symbols in classical AI systems — the assumption that symbols are static tokens whose reference is fixed by external stipulation. Taniguchi et al. [2] challenged this assumption directly, arguing on the basis of a comprehensive survey of robotic symbol emergence research that symbol systems are better understood as emergent, self-organizing phenomena arising through interactions among cognitive agents and their environments. Rather than inheriting fixed meaning from a designer, genuine symbols — in humans and potentially in robots — crystallise through processes of perceptual categorisation, social negotiation, and adaptive development over time [1]. This framing extended Harnad’s critique by targeting not just the grounding deficit in formal computation but the static, pre-given character of classical symbol assignment — a deficit that Harnad himself identified as arising from the purely syntactic manipulation of tokens that lack any intrinsic connection to the world they purport to represent.

Building on this perspective, the semiotic dimension was elaborated further in a subsequent survey by Taniguchi et al. [19], which situated symbol emergence within the framework of semiosis — C.S. Peirce’s account of active, interpretive sign processes. On this view, symbols do not merely denote; they emerge through triadic sign relations involving agent, sign vehicle, and interpretant, evolving over developmental time through social interaction. Concept formation and the social negotiation of shared symbolic conventions are identified as more fundamental challenges than the mere attachment of signs to perceptual categories [1]. Together, the Taniguchi et al. surveys [2, 19] thus shift the emphasis from the static question Harnad posed — how does a symbol come to refer? — to the dynamic, developmental question of how symbol systems self-organise across interacting agents.

Gaps and Outstanding Questions

Despite this rich theoretical foundation, several significant gaps remain in the accounts surveyed thus far. Most concretely, the frameworks of [29], [20], and [38] are heavily weighted toward perceptual or action-related concepts — such as colour, shape, and motion — where grounding mechanisms are intuitively tractable. However, these accounts do not adequately formalize how abstract concepts like justice, freedom, or negation become grounded [39].

Crucially, abstract concepts of this kind — which are less bound by space, time, and direct sensory experience — may rely on partially distinct cognitive mechanisms from concrete ones, with language, social metacognition, and inner experience playing differentially prominent roles [42]. Recent neuroimaging work further complicates the picture: negation, for instance, has been shown to modulate rather than simply invert neural representations of the concepts it operates over [43], a finding that no current computational model adequately captures.

Additionally, the neuroimaging evidence assembled by [38] and the computational modelling traditions represented by [20] and [2] have developed largely in parallel, with insufficient integration between them [5, 33]. Furthermore, the developmental trajectory from pre-linguistic sensorimotor engagement to adult-level abstract thought — gestured at by [19] — remains poorly captured in any single computational model, representing perhaps the most significant frontier for future work.

However, the abstract grounding challenge is not a theoretical vacuum. Cognitive science has produced several substantive accounts of how abstract concepts may be grounded through embodied experience, metaphorical structure [44], and social interaction [42]. Although these frameworks remain largely unoperationalised in computational systems, they carry specific predictions that bear directly on the design and evaluation of multimodal AI.

Grounding Abstract Concepts: Metaphor, Simulation, and Social Interaction

The most influential theoretical response to the abstract concept challenge has been Conceptual Metaphor Theory (CMT), first articulated by [45] and subsequently elaborated across linguistics, cognitive psychology, and neuroscience. CMT’s central claim is that abstract conceptual domains are not understood on their own terms but are systematically structured by metaphorical mappings from concrete, experientially grounded source domains. On this account, the concept of TIME is not grasped directly but through the metaphor TIME IS MONEY (yielding entailments such as “spending time” and “wasting time”), while LOVE is structured by the mapping LOVE IS A JOURNEY, generating inferences about relationships reaching “dead ends” or being “on the right track” [45]. Crucially, these are not decorative linguistic devices but constitutive cognitive structures: abstract concepts are, on this view, partially constituted by their metaphorical mappings to embodied experience, and the logical entailments carried by each metaphor produce coherent clusters of related inferences that structure reasoning within the target domain. A recent comprehensive synthesis of the Lakoff-Johnson programme [46] has reinforced these claims while emphasising their biological plausibility, arguing that image schemas — recurrent patterns abstracted from sensorimotor experience, such as CONTAINER, PATH, and FORCE — serve as the foundational bridging structures connecting prelinguistic bodily experience to the structured conceptual domains that language encodes, and that this account provides a biologically grounded alternative to objectivist semantics that treats truth and meaning as relative to embodied understanding rather than to mind-independent features of the external world.

[44] extended CMT into a neurocognitive framework, proposing that primary metaphors — the elemental building blocks of metaphorical thought — arise through Hebbian learning when concrete sensorimotor schemas are regularly co-activated with subjective experiential states during development. On this account, the metaphor AFFECTION IS WARMTH is not learned through explicit instruction but crystallises from the repeated co-occurrence of physical warmth and affective comfort in early caregiving contexts. Once established, these neural circuits link concrete sensorimotor brain regions to abstract reasoning areas, enabling the extension of embodied knowledge into domains that have no direct perceptual counterpart. The neural exploitation hypothesis of [32] provides a complementary mechanistic account, arguing that sensory-motor circuits evolved for action and perception are co-opted — exploited — for conceptual and linguistic functions, with conceptual metaphor serving as the primary vehicle through which this exploitation extends into abstract thought. On this view, simulation — the same mechanism that [20] proposed for concrete concept understanding — operates in abstract domains as well, but through metaphorically mediated re-use of sensorimotor neural substrates rather than through direct perceptual re-enactment.

Empirical evidence for the embodied processing of metaphorical language has accumulated across behavioural and neuroimaging paradigms. [47] demonstrated that performing or merely imagining a physical action (such as pushing) facilitates comprehension of metaphorical phrases that recruit the same action schema (such as “push an argument”), with reading times significantly faster when body movement matched the metaphor’s source domain. This finding is difficult to explain on a purely amodal account of metaphor comprehension; instead, it supports the claim that abstract language understanding involves partial sensorimotor simulation of the concrete actions from which metaphors derive. Neuroimaging work by [48] extended this picture by showing that idiomatic sentences containing action words (e.g., “grasp an idea”) elicit somatotopic activation patterns along the motor strip — arm-related idioms activating lateral motor regions, leg-related idioms activating dorsal motor regions — paralleling the activation patterns observed for literal action sentences. That somatotopic organisation persists even for highly conventionalised figurative expressions suggests that the sensorimotor grounding of metaphor is not merely a developmental scaffold but remains constitutively involved in online language processing.

However, CMT alone cannot account for the full range of abstract concepts. Many abstract terms — democracy, epistemology, obligation — resist straightforward metaphorical decomposition into concrete source domains, and the theory has been criticised for offering post hoc metaphorical analyses rather than predictive accounts of how novel abstract concepts emerge. Recognising these limitations, [6] situated metaphorical grounding within the broader framework of grounded cognition, arguing that conceptual processing involves the partial re-enactment of multimodal experiential states — including not only sensorimotor but also introspective and affective components — that are contextually appropriate to the current situation. On Barsalou’s account, abstract concepts are grounded not solely through metaphorical mapping but through the situated simulation of the complex experiential contexts in which they were acquired and used, drawing on modal systems for perception, action, and introspection rather than amodal data structures. Recent empirical work provides convergent evidence for this broader grounding substrate: [49] demonstrated that abstract concepts such as “success” can modulate low-level perceptual phenomena, with a mini meta-analysis across four experiments (N = 271) revealing a small but statistically significant effect (d = 0.27, BF₁₀ = 13.25) whereby targets described as successful were mislocalized further forward in a representational momentum paradigm than unsuccessful ones — suggesting that abstract meaning interfaces with physical invariants in ways that extend grounding beyond the body-action mappings emphasised by CMT.

A fundamentally different route to abstract grounding is proposed by the Words As social Tools (WAT) theory [42], which argues that abstract concepts are grounded not primarily in sensorimotor experience but in linguistic, social, interoceptive, and metacognitive processes. WAT’s core claim is that abstract concepts are acquired predominantly through social and linguistic interaction rather than through direct perceptual encounter, and that their processing correspondingly recruits brain areas associated with language production and social cognition to a greater degree than concrete concepts do. The theory proposes four tenets: abstract concepts rely more heavily on social and linguistic input during acquisition; they engage language-related and social-cognitive neural systems; they involve greater activation of the oral motor system (reflecting the role of inner speech and verbal mediation); and their representation varies across languages and cultures [42]. More recently, [50] have elaborated the social dimension further, distinguishing socialness — the degree to which a concept evokes social scenarios — from social metacognition — the need to negotiate and validate meaning through interaction with others. On this account, the characteristic difficulty of abstract concepts arises not from their distance from sensory experience per se but from their semantic indeterminacy, which makes social calibration essential for their acquisition and stable use. This perspective resonates with the developmental emphasis of [19] on social negotiation as a driver of symbol emergence, while providing a more specific account of why abstract concepts in particular depend on social scaffolding.

A pluralistic synthesis of these positions was proposed by [39], who argued that conceptual representations are encoded through two complementary systems: embodied sensorimotor simulations and what he termed dis-embodied linguistic representations. On Dove’s account, natural language serves as a cognitive tool that enables abstraction and generalisation precisely because linguistic symbols derive meaning partly from their associations within a symbolic system rather than exclusively from sensorimotor grounding. Abstract concepts, on this view, rely more heavily on linguistic representations, while concrete concepts draw more directly on sensorimotor simulation — a division of labour that resonates with the concreteness effects observed across behavioural and neuroimaging literatures. This pluralistic framework accommodates both [38]‘s evidence for amodal conceptual representations and the embodied simulation findings of [20] and [48], treating them as descriptions of different components of a heterogeneous conceptual system rather than as competing accounts of a unitary representational format.

Taken together, these frameworks — CMT, situated simulation, WAT, and pluralistic accounts — demonstrate that the grounding of abstract concepts is a domain with substantive, empirically supported theoretical proposals rather than an uncharted frontier. Each framework generates specific predictions for multimodal AI: CMT predicts that abstract concepts structured by embodied metaphors should benefit from multimodal training that instantiates the relevant concrete source domains; WAT predicts that abstract concept acquisition should benefit from social and dialogic training contexts rather than passive image-text pairing; the pluralistic account predicts a graded asymmetry in which visual grounding improves concrete representations more reliably than abstract ones. As subsequent sections will demonstrate, these predictions remain almost entirely unoperationalised in the computational literature — a gap that is not merely an engineering oversight but a missed opportunity to leverage decades of cognitive science theory in the design and evaluation of multimodal systems.

Collectively, CMT, situated simulation, WAT, and pluralistic frameworks converge on a shared implication: that genuine grounding of abstract concepts requires training regimes, architectural affordances, or interactive contexts that go substantially beyond the passive pairing of images and text on which current multimodal systems are built. The theoretical disagreements among these accounts — over whether the primary grounding substrate is sensorimotor, social, metaphorical, or heterogeneous — do not dissolve this shared implication but rather specify different dimensions along which existing architectures are likely to fall short. The sections that follow test these theoretical expectations against the computational and empirical record, moving from embodied robotic systems that attempt to operationalise physical grounding, through vision-language models that approximate it through large-scale perceptual co-training, to the systematic failure modes that reveal where the gap between theoretical adequacy and current implementation remains most acute.

4. Symbol Grounding in Robotics and Embodied AI Systems

Symbol grounding in robotics and embodied AI systems concerns the central engineering challenge of connecting linguistic symbols to perceptual and motor representations so that artificial agents can understand and act on language about the physical world. While the theoretical problem was articulated by Harnad in the early 1990s, its computational operationalisation has followed a distinct trajectory — from probabilistic graphical frameworks grounding individual route instructions to neuro-symbolic architectures capable of compositional concept learning, and more recently to embodied agents navigating real and simulated environments using open-vocabulary visual grounding. The field has not evolved linearly but through a series of competing commitments: between interpretability and performance, between symbolic structure that must be provided versus structure that can be learned, and between benchmark accuracy and generalisation to the open world.

Probabilistic Foundations: Structure, Interpretability, and Their Costs

Early computational work on robotic symbol grounding was distinguished by its commitment to explicit probabilistic structure. The Generalized Grounding Graphs (G³) framework [51] represented natural language instructions as factor graphs whose variables corresponded to groundings of noun phrases, verbs, and spatial relations to perceptual referents in a robot’s environment. By factoring the joint probability of a grounding into aligned sub-problems, G³ could be trained from instruction-action pairings and could learn subtle distinctions between spatial prepositions such as to, past, and toward — distinctions that require integrating trajectory and endpoint semantics. The framework achieved 67% accuracy on route-following tasks against a human baseline of 85% [51]. This performance gap is instructive: it marks not merely a performance ceiling but the cost of the transparency built into the approach.

Contemporaneously, [1] provided the field with a taxonomic clarification that has since proved durable. Their review distinguished physical symbol grounding — the anchoring of symbols to sensorimotor categories through mechanisms such as self-organizing maps and supervised classifiers — from social symbol grounding, wherein shared symbolic meanings emerge through inter-agent negotiation and communicative use. This distinction matters because, as [1] observed, the bulk of computational implementation had concentrated on the physical branch, with social grounding remaining largely a theoretical aspiration.

The Symbol Emergence in Robotics (SER) framework [2] later formalized this critique, arguing that treating symbol systems as static and individually grounded — rather than as emergent, self-organizing phenomena shaped by social interaction — reflects a fundamental flaw inherited from the physical symbol system hypothesis. On the SER account, symbols are inherently dynamic triadic processes coupling individual sensorimotor experience with socially negotiated convention. This view underscores why engineering social grounding demands more than scaling up physical grounding methods. That asymmetry persists: the engineering infrastructure for embodied physical grounding has advanced dramatically since 2013 [2], while social grounding remains comparatively underimplemented.

Neuro-Symbolic Concept Learning: What Structure Must Be Provided?

The most sustained debate within this theme concerns the degree to which symbolic structure must be engineered into a grounding system or can instead emerge from supervision. The Neuro-Symbolic Concept Learner (NS-CL) [52] staked out a strong position: by jointly training a neural perception front-end (Mask R-CNN object segmentation combined with ResNet attribute encoding) and a symbolic reasoning module operating over disentangled concept representations, NS-CL achieved 98.9% accuracy on the CLEVR visual question answering benchmark without requiring annotated program traces. Critically, NS-CL maintained this accuracy when trained on only 5,000 images — 10% of the available data — while comparable purely neural baselines fell to 51–68% accuracy [52]. This data efficiency is the empirical signature of the symbolic component: structured representations do more with less evidence.

Yet NS-CL still requires a curriculum of increasingly complex questions and relies on the availability of scene-parse annotations during training. LEFT [53] challenged this dependency directly. Rather than hardcoding a symbolic executor, LEFT uses a large language model to translate natural language queries into executable logical programs, enabling zero-shot transfer to novel and compositionally complex tasks across seven benchmarks without task-specific training [53]. The shift is substantive: where NS-CL learns a symbolic executor jointly with visual representations, LEFT borrows symbolic structure from a pre-trained language model’s implicit knowledge of logical form. This distinction maps onto the broader question of whether symbol grounding is fundamentally a learning problem or a translation problem — and LEFT’s strong generalisation performance suggests that foundation model priors can substitute for much of what structured training previously provided.

NeSyCoCo [24] complicates this picture further by demonstrating that neither approach is uniformly superior. On CLEVR-CoGenT Split B — a compositional generalisation benchmark requiring held-out attribute-object combinations — NeSyCoCo outperforms both MDETR (an end-to-end neural model) and LEFT, with its soft symbolic reasoner identified as the single most statistically significant contributor to performance gains [24]. On ReaSCAN, a spatial reasoning benchmark with complex multi-relational instructions, NeSyCoCo similarly surpasses prior methods on the hardest test splits [24]. The implication is that explicit symbolic structure — particularly a differentiable reasoner that can propagate uncertainty across compositional steps — remains important for systematic generalisation even when foundation models are available as components. The debate, then, is not resolved in favour of either pure learnability or pure provision; rather, the 2025 evidence suggests that where symbolic structure is introduced in the processing pipeline, and how it interacts with neural uncertainty, matters as much as its presence or absence.

The significance of NeSyCoCo’s results is sharpened when situated within the broader literature on compositional generalization in neural networks. Controlled benchmarks such as COGS [13] have established that standard sequence-to-sequence models routinely achieve near-perfect in-distribution accuracy while collapsing to 16–35% on compositionally novel structures — a gap that synthetic robotic environments partially obscure because their compositional complexity is constrained by design. Complementary work on structural modification of Transformers has demonstrated that targeted architectural and data augmentation interventions can substantially recover compositional performance on such benchmarks [22], further sharpening the diagnostic value of the accuracy gap as an indicator of genuine symbolic deficiency rather than mere training insufficiency. The five-dimensional decomposition of compositional competence proposed by [14] — systematicity, productivity, substitutivity, localism, and overgeneralization — provides a framework for interpreting why neuro-symbolic architectures like NeSyCoCo outperform end-to-end neural alternatives on specific test splits: the symbolic reasoning component enforces systematicity and substitutivity along compositional dimensions that purely neural models leave to chance. At the same time, the meta-learning results of [23], which demonstrate that standard Transformer architectures can approach human-level systematicity when trained on distributions of compositional tasks, raise the question of whether NeSyCoCo’s symbolic scaffolding is a permanent architectural necessity or a compensatory mechanism for training regimes that do not adequately incentivize compositional generalisation. Additionally, [54] address the reliability of the neural components within neuro-symbolic pipelines, introducing a conformal-prediction-based active learning framework that achieves 98.4% success in recovering ground-truth symbolic programs through targeted human interaction — a result that highlights how errors at the neural-symbolic interface, if left unmanaged, propagate into systematic grounding failures downstream.

Embodied Navigation: Grounding at the Action Level

A parallel operationalisation of symbol grounding concerns instruction-following in navigation tasks, where grounding must connect not just words to objects but sequences of instructions to extended physical trajectories [51]. UAV-VLN [55] addresses this problem for unmanned aerial vehicles through a four-stage pipeline: a fine-tuned TinyLlama-1.1B model parses natural language instructions into structured task representations, an automated planner generates waypoint sequences, and open-vocabulary detectors execute visual grounding in the scene. The central empirical finding is that open-vocabulary detectors — Grounding DINO and CLIPSeg — substantially outperform closed-vocabulary alternatives such as YOLO, with Grounding DINO yielding the highest overall success rates [55]. This result connects directly to the concern raised by [1] about the brittleness of symbol-to-percept mappings: closed-vocabulary detection encodes the grounding problem as a fixed lookup, while open-vocabulary detection operationalises something closer to genuine concept generalisation, matching the flexibility that language users expect of their interlocutors. Complementary evidence for the importance of flexible, open-ended grounding in navigation comes from VELMA [56], which demonstrates that verbalising visual observations via CLIP-based landmark scoring — rather than relying on fixed visual categories — enables LLM agents to navigate real-world urban street networks from natural language instructions, achieving over 25% relative improvement over prior state-of-the-art on the Touchdown and Map2seq benchmarks.

Limitations and Open Questions

Despite this trajectory of progress, the field carries several persistent limitations. Most neuro-symbolic architectures — including NS-CL, NeSyCoCo, and LEFT — are evaluated on synthetic or semi-synthetic environments (CLEVR, ReaSCAN, CLEVR-CoGenT) where scene complexity, lighting variation, and object diversity are tightly controlled [52, 24, 53]. Transfer to real-world visual complexity remains largely untested, and the gap between synthetic benchmark performance and genuine robotic deployment is substantial. Recent work on large vision-language model robustness confirms that even state-of-the-art models remain brittle to fundamental visual variations — including illumination shifts, viewpoint changes, and texture perturbations — that are systematically absent from controlled benchmarks [11, 12]. The earlier probabilistic approach of G³ [51] was at least evaluated on physical robot data, a methodological advantage that the neuro-symbolic literature has not systematically recovered. Furthermore, as [1] identified, social symbol grounding — grounding through communicative interaction with other agents rather than through individual sensorimotor experience — has attracted almost no corresponding engineering effort in the subsequent decade [19]. Long-horizon task grounding, where an agent must integrate multiple sequential instructions over extended timeframes, remains challenging; UAV-VLN begins to address this but in simplified simulated environments [55], while embodied navigation approaches such as VELMA [56] suggest that verbalized environment representations may offer one path forward. Finally, the integration of multimodal foundation models with structured symbolic planners — the architectural synthesis implied by both LEFT and UAV-VLN — is still exploratory, and principled methods for deciding how much structure to delegate to foundation model priors versus to task-specific learned or engineered components remain an open design question [4].

5. Multimodal Distributional Semantics and Visually Grounded Language Learning

Multimodal distributional semantics sits at the intersection of cognitive science, computational linguistics, and machine perception, addressing a fundamental question: does exposure to perceptual signals during language learning produce qualitatively better semantic representations than text co-occurrence alone? The work reviewed here traces a trajectory from early fusion models combining text and image statistics, through deep-learning systems that learn language directly from raw speech and visual input, to architectures capable of discovering word-like units from audio-visual co-occurrence without any transcription — revealing both the power and the persistent limitations of grounding language in perception.

Multimodal Fusion and the Complementarity of Modalities

Early foundational work established the core empirical case for visual grounding in semantic models. [57] demonstrated that multimodal distributional models combining text co-occurrence statistics with visual features extracted from images consistently outperform text-only baselines on semantic relatedness benchmarks. A key theoretical contribution of that work is the observation that the two modalities carry systematically different information: image-derived vectors excel at capturing perceptual attributes — colour above all — while text-based vectors encode encyclopedic and relational properties [5]. The best-performing configuration, a feature-level fusion model with equal modality weighting (TunedFL), exploits this complementarity rather than subordinating one signal to the other [57]. More recent work on large-scale multimodal integration has continued to confirm this division of labour, with visual and linguistic channels making distinct, non-redundant contributions to shared semantic representations [58].

This finding immediately raised a question that subsequent research has revisited repeatedly: are the gains uniform across the lexicon, or do they cluster around certain concept types? The answer from [57] was already suggestive — benefits are most reliable for concrete, imageable nouns, the very concepts for which visual input provides rich, stable signal. For abstract vocabulary the picture remained murky, and this asymmetry has since become one of the field’s central unresolved tensions [38, 42, 39]. The problem is not merely technical but theoretical: if visual grounding primarily scaffolds concrete concepts, then multimodal distributional models cannot claim to offer a general account of meaning [5, 30].

From Text-Paired Images to Raw Speech: The Deep Learning Turn

The shift from bag-of-words distributional models to deep neural architectures opened a substantially different research paradigm, one in which the goal is not to augment a text-based model with visual features but to learn language representations from scratch using audio and vision alone. [18] provides a systematic map of this transition, identifying two historical phases: a cognitive-science-oriented wave running roughly from 1999 to 2005, concerned with modelling infant cross-situational word learning, and a deep-learning-driven phase from approximately 2015 onward, motivated more directly by engineering goals. The survey is valuable precisely because it foregrounds continuities across these phases — the underlying hypothesis that linguistic structure can emerge from audio-visual co-occurrence has remained constant even as the methods have transformed entirely [18].

A consistent architectural pattern has emerged in the more recent phase: visual encoders are typically initialised with supervised pre-training on large image classification tasks [59], while audio encoders vary considerably — convolutional networks, recurrent networks, and hybrid architectures have all been explored [27, 59]. A notable example is the multi-layer recurrent highway network proposed by [27], trained on synthetically spoken MS COCO captions, which was specifically motivated as a more faithful alternative to prior convolutional approaches for capturing the temporal structure of continuous speech. Separately, [59] demonstrated that a convolutional audio-visual model trained on over 500 hours of speech could discover word-like acoustic units and their visual groundings without any text supervision — with the resulting embedding space organising words both acoustically and semantically. Crucially, [18] notes that no clear winner has emerged between convolutional and recurrent audio encoders, a finding with implications for theoretical claims about what inductive biases are necessary for speech-grounded learning. If the architecture does not determine what linguistic structure emerges, then the structure is likely driven more by the statistics of the audio-visual pairing than by the encoder’s temporal processing assumptions — a point that invites more controlled ablation studies than the field has so far conducted.

Layerwise Organisation of Grounded Speech Representations

A more fine-grained empirical picture of what grounded speech models actually learn comes from [27], which probed the internal representations of a multi-layer recurrent highway network trained to associate spoken utterances with images. The central finding is a layerwise dissociation: lower layers predominantly encode form-related properties of the speech signal — phonetic detail, acoustic surface structure — while higher layers progressively encode semantic content, with semantic richness peaking at the uppermost layers. Form-related encoding, by contrast, peaks at intermediate layers and then decays as network depth increases [27]. A broader survey of architectures and evaluation techniques in this area confirms that this representational hierarchy — lower layers for form, higher layers for meaning — is a recurring finding across visually grounded speech models [18].

This layerwise emergence of semantics closely parallels findings from probing studies of text-only neural language models, suggesting that the organisational principle may be a general feature of deep representation learning rather than specific to the visual grounding setup. Probing studies of models such as BERT have demonstrated a consistent hierarchical ordering, in which surface-level properties such as part-of-speech are resolved in early layers, while deeper semantic and coreference information distributes across higher layers [60, 61]. However, [27] also identify capabilities unique to the grounded speech model: its attention mechanism allows it to outperform text-based models on inputs involving misspellings or unusually long utterances, a robustness that arises from operating over continuous acoustic representations rather than discrete orthographic tokens. This finding complicates simple comparisons between grounded and text-only systems, since the two architectures are exposed to fundamentally different input spaces, making direct performance comparisons difficult to interpret without careful experimental control.

Unsupervised Discovery of Word-Like Units

Perhaps the most ambitious instantiation of visually grounded learning is the attempt to recover word-like structure directly from unlabelled audio-visual data — bypassing not only manual transcription but any pre-existing lexical categories. [59] trained a multimodal neural network on over 200,000 image-caption pairs in which captions were spoken aloud, and showed that the resulting system discovers audio segments that function as word-like units: semantically coherent clusters emerge in the acoustic embedding space, with cluster variance serving as a reliable predictor of cluster purity. Low-variance clusters tend to be large and lexically coherent, while high-variance clusters are acoustically diverse and semantically incoherent — what the authors term “garbage” clusters [59]. The system achieved image retrieval performance (R@10 of 0.431) that substantially exceeds an unsupervised baseline and approaches a text-supervised variant (0.525), a notable result given the absence of any transcription during training.

Complementary evidence for hierarchical structure in visually grounded speech models comes from [27], who analysed a recurrent highway network jointly trained on spoken captions and images and found a clear functional stratification: lower network layers encode surface, form-related properties of speech, while higher layers progressively accumulate semantic representations — a division of labour that mirrors patterns observed in human language processing. The emergence of word-level structure without explicit segmentation signals is further corroborated across the broader literature reviewed in [18].

The [59] results extend the empirical claims of [57] into an entirely unsupervised setting and raise the possibility that lexical segmentation is not a prerequisite for semantic learning but may instead be a consequence of it. Where [57] assumed a pre-existing vocabulary and asked whether visual features improve its semantic representations, [59] asks whether a vocabulary-like structure emerges at all — and finds a qualified affirmative answer.

Persistent Gaps and Open Debates

Despite these advances, several significant limitations persist. The concreteness asymmetry identified in [57] has not been resolved by the subsequent shift to grounded speech architectures; abstract concepts remain poorly served by visually grounded methods across the literature [5, 38]. As discussed in Section 3, this asymmetry is not theoretically unexpected — pluralistic accounts of conceptual representation predict that abstract concepts rely more heavily on linguistic and social grounding than on perceptual input [39, 42], while Conceptual Metaphor Theory suggests that visual grounding should benefit abstract concepts only insofar as it instantiates the concrete source domains from which metaphorical structure derives [44]. Yet no existing multimodal model has been designed to operationalise these predictions — for instance, by incorporating metaphorical cross-domain mappings as explicit training signals, by leveraging social and dialogic contexts for abstract concept acquisition, or by architecturally separating the grounding pathways for concrete perceptual content and abstract linguistically mediated content. The result is that the concrete-abstract asymmetry, though theoretically explicable, remains computationally unaddressed. The evaluation environments across most studies — curated image-caption datasets, controlled audio conditions — are far removed from the naturalistic audio-visual streams in which human language acquisition occurs, a gap that [18] explicitly flags as limiting the ecological validity of existing benchmarks. The lack of systematic comparison between what linguistic knowledge emerges in visually grounded versus text-only models at equivalent scale means that grounding’s specific contribution remains difficult to isolate [18, 27]. Relatedly, the broader survey literature on multimodal pre-training [3] confirms that dataset construction and evaluation remain dominated by a small number of high-resource linguistic contexts. Finally, the entire research programme remains heavily centred on English and a small number of well-resourced languages [18], while the extension to non-visual sensory modalities — haptic, olfactory, proprioceptive — which plausibly ground important semantic distinctions [33, 34] has attracted almost no computational attention.

6. Vision-Language Architectures, Pretraining Frameworks, Datasets, and Evaluation Benchmarks

The technical infrastructure underlying vision-language research — spanning model architectures, pretraining strategies, data resources, and evaluation protocols — has undergone fundamental transformation over the past decade. Early work established a conceptual vocabulary for the core challenges of cross-modal learning, while subsequent architectural innovations and large-scale datasets enabled empirical breakthroughs. The most recent advances have pushed evaluation methodology beyond simplistic reference matching, though important gaps in how the field measures genuine visual grounding persist. Understanding this infrastructure is essential because the quality of architectures, data, and metrics directly conditions all downstream claims about multimodal understanding.

Conceptual Foundations and Challenge Taxonomy

A principled framework for understanding the difficulties inherent to multimodal learning was laid out early by [58], whose taxonomy identified five core technical challenges: representation, translation, alignment, fusion, and co-learning. The heterogeneity problem — the fact that language is fundamentally symbolic while audio and visual signals are continuous and high-dimensional — was foregrounded as the root difficulty from which others derive [58]. This symbolic–perceptual divide connects directly to the classical symbol grounding problem [1, 2], which asks how discrete linguistic symbols acquire meaning through connection to continuous sensory experience. The taxonomy proved durable, providing subsequent architectural work with a shared vocabulary for locating specific design choices within a broader technical landscape [4]. Representation concerns how individual modalities are encoded; fusion concerns how signals are combined; alignment concerns the discovery of cross-modal correspondences at sub-element granularity. Each challenge maps onto distinct architectural decisions that researchers have since operationalized in concrete ways.

The taxonomy was subsequently expanded and deepened by [62], who grounded the field in three organising principles — modality heterogeneity, cross-modal connections, and multimodal interactions — and proposed a six-challenge framework that refines the original five into representation, alignment, reasoning, generation, transference, and quantification. The addition of quantification as a first-class challenge is particularly consequential: it addresses how to measure, attribute, and benchmark the specific contributions of each modality to a multimodal prediction, a problem with direct implications for interpretability and robustness evaluation that the earlier taxonomy left implicit [62, 17]. The reasoning challenge, similarly, makes explicit a capacity that was folded into fusion in the earlier framework but that — as subsequent probing [18, 10] and compositional generalization work [15, 24] has demonstrated — involves representational demands qualitatively distinct from feature combination. Together, the two frameworks provide the conceptual scaffolding against which specific architectural choices can be situated and evaluated.

Dense Annotation Infrastructure: Visual Genome

Addressing the alignment challenge empirically required datasets with far richer structure than image-level labels. Visual Genome [63] was a pivotal contribution in this regard, providing densely crowdsourced annotations that connected language directly to image regions through object instances, attributes, and relationships. With 3.84 million object instances across 33,877 unique categories and an average of 35 objects per image, the dataset dramatically exceeded the category coverage of contemporaries such as ImageNet, PASCAL, and MS-COCO [63]. Critically, joint training of object and attribute classifiers on Visual Genome improved top-one accuracy from 18.97% to 43.17%, demonstrating that co-occurrence statistics between objects and their attributes carry genuine discriminative signal [63]. This resource became a standard component in pretraining pipelines for region-grounded architectures, supplying the fine-grained supervision that image-caption pairs alone could not provide. Concrete examples include LXMERT [7], which aggregated 9.18 million image-sentence pairs and relied on detected bounding-region embeddings whose object-level features were drawn from Visual Genome-trained detectors, and Unicoder-VL [8], which similarly employed Faster R-CNN region features pre-trained on Visual Genome to ground cross-modal representations before fine-tuning on downstream retrieval and reasoning tasks.

Cross-Modal Transformer Architectures and Pretraining

The architectural landscape shifted decisively with the rise of transformer-based pretraining frameworks. Among the earliest to demonstrate the viability of deep cross-modal fusion was Unicoder-VL [8], which adopted a single unified Transformer encoder processing image region features and language tokens jointly from the first layer onward, with three pretraining objectives — Masked Language Modeling, Masked Object Classification, and Visual-Linguistic Matching. On Flickr30K, Unicoder-VL achieved 86.2% and 71.5% R@1 for sentence and image retrieval respectively, with gains from pretraining proportionally largest on lower-resource benchmarks — an empirical signature suggesting that the learned representations capture generalisable structure rather than task-specific surface patterns [8]. The unified encoder design, however, carries an architectural cost: both modalities must be simultaneously present at inference, limiting scalability in asymmetric retrieval scenarios.

LXMERT [7] introduced a more modular alternative through three interacting transformer modules — a language encoder, an object-relationship encoder, and a cross-modality encoder — that supported bidirectional cross-attention between the two modalities. This design allowed masked tokens in one modality to be reconstructed using information from the other, enabling genuinely cross-modal pretraining objectives rather than independent unimodal pretraining followed by fusion [7]. The practical consequences were substantial: LXMERT achieved state-of-the-art results on VQA [9] and GQA and produced particularly striking improvements on NLVR², with gains of 22 percentage points in accuracy and 30 points in consistency over prior methods [7]. The strength of these results on VQA — a benchmark requiring joint reasoning over image content [9] and natural language questions — is especially notable given that such tasks demand fine-grained cross-modal alignment rather than coarse retrieval-style matching. These results established bidirectional cross-attention as a powerful mechanism for multimodal interaction and validated pretraining at scale as a viable paradigm for vision-language understanding.

The subsequent survey by [4] systematized the architectural diversity that had emerged in the years following these foundational models, providing a geometrically motivated account of transformer-based multimodal learning in which tokenized inputs are treated as nodes in fully-connected graphs. Six distinct self-attention variants for multimodal interaction were catalogued: early summation, early concatenation, hierarchical attention in multi-to-one-stream and one-to-multi-stream configurations, cross-attention, and cross-attention to concatenation [4]. The survey’s central architectural tension — whether early fusion, late fusion, or cross-attention designs are optimal — remains unresolved; [4] explicitly identify no definitive winner across tasks, a finding that reflects genuine context-sensitivity in how modalities should be combined rather than an absence of progress. A complementary survey by [3] documents this tension from a different angle, distinguishing single-stream architectures — in which all modalities share attention layers from the first layer — from cross-stream (dual-encoder) designs that maintain modality-specific representational pathways with targeted interaction points. The single-stream approach, exemplified by Unicoder-VL, achieves tight cross-modal fusion at every layer but sacrifices modularity; the cross-stream approach, exemplified by contrastive dual-encoder models [64], preserves the reusability of unimodal pretrained encoders and reduces data requirements for multimodal alignment, but may sacrifice the fine-grained cross-modal interactions needed for complex reasoning [3]. This taxonomic work connects directly to the co-learning and fusion challenges described by [58] and refined by [62], making visible how conceptual categories have been operationalized across a diverse architectural literature.

Extending Beyond Static Images: Video-Language Pretraining

The extension of vision-language pretraining to video introduced challenges that flat image-text architectures were ill-equipped to handle. Video is inherently temporal and multi-granular: understanding a video requires simultaneously aligning local frame-subtitle correspondences and integrating these local signals into a coherent global narrative. HERO [65] addressed this through a hierarchical two-stage architecture: a Cross-modal Transformer first aligns each video frame with its surrounding subtitle tokens, producing locally fused representations, which are then processed by a Temporal Transformer that captures global temporal dependencies across the full video sequence. Critically, HERO introduced Frame Order Modeling — a pretraining objective requiring the model to predict the correct temporal ordering of shuffled video frames — forcing the development of an internal representation of temporal structure rather than treating video as an unordered bag of frames [65]. Ablations confirmed that both hierarchical stages were necessary, with flat BERT-like encoders proving insufficient across six downstream benchmarks spanning retrieval, question answering, inference, and captioning [65].

A contrasting philosophy was introduced by VideoCLIP [64], which adopted the dual-encoder contrastive framework — separate Transformer encoders for video and text aligned through symmetric InfoNCE loss [64] — sacrificing deep cross-modal attention for scalability and modality-independent encoding at inference. This design aligns with the broader dual-encoder paradigm surveyed across multimodal transformer architectures, wherein representational efficiency at deployment is prioritised over interaction depth during encoding [4]. Two methodological contributions distinguished VideoCLIP: overlapping video-text clip construction, which improved retrieval performance by approximately four percentage points by allowing looser temporal correspondence between paired segments, and retrieval-augmented hard negative mining that sharpened representations by preventing reliance on trivially dissimilar negatives [64]. The result was a model capable of zero-shot transfer across video retrieval, question answering, action segmentation, and localisation, outperforming prior supervised baselines in several settings. The progression from Unicoder-VL and LXMERT through HERO to VideoCLIP thus reflects not merely a modality shift from images to video but a sustained architectural negotiation between representational richness and deployment efficiency — a trade-off that surveys of large-scale multimodal pretraining identify as a defining tension in the field [3] and that each new application context reweights.

Transformer architectures have also been applied beyond the canonical image-text and video-text settings. [66] survey the deployment of transformers and adversarial learning frameworks in image-to-audio conversion, image captioning, and speech synthesis, demonstrating that the architectural principles developed for vision-language tasks transfer into multimodal accessibility contexts. Image-to-speech conversion, in particular, integrates computer vision and natural language processing pipelines to generate auditory representations of visual scenes, with transformers increasingly supplanting earlier recurrent and convolutional approaches [66]. This extension broadens the scope of the vision-language infrastructure beyond static image-text alignment into temporal and acoustic domains, highlighting underexplored dimensions of multimodal translation as defined by [58].

Evaluation Metrics: From Reference-Based to Multimodal Human-Feedback Approaches

The question of how to measure vision-language alignment quality has undergone perhaps the most significant recent evolution. For many years, reference-based metrics such as CIDEr, METEOR, and SPICE dominated captioning evaluation, operating under the assumption that multiple ground-truth captions were necessary to provide a reliable signal. These metrics each operationalize caption quality differently — CIDEr via TF-IDF-weighted n-gram consensus, METEOR via stemmed unigram alignment with synonym matching, and SPICE via semantic propositional content parsed from scene graphs — yet all share the fundamental dependency on reference text availability [67]. CLIPScore [67] challenged this assumption directly by demonstrating that cosine similarity between CLIP image and caption embeddings, computed without any reference text, consistently outperformed reference-based metrics in correlation with human judgments across standard captioning benchmarks. The reference-augmented variant, RefCLIPScore, achieved even higher correlations when references were available, but the headline finding — that reference-free evaluation is not only viable but superior — represented a meaningful shift in methodological consensus [67].

More recent work has pushed evaluation further by incorporating large-scale human feedback directly into metric training. Polos [68] extends the multimodal evaluation paradigm by learning from 131,000 human judgments through a supervised framework that uses parallel contrastive learning embeddings over image-caption pairs. The M2LHF framework underlying Polos specifically targets two limitations of CLIPScore-style metrics: sensitivity to hallucinations and difficulty generalizing across diverse image and text distributions [68, 69]. By achieving state-of-the-art performance across multiple captioning evaluation benchmarks, Polos demonstrates that the direction established by CLIPScore — leveraging multimodal representations directly — is extensible through human feedback rather than having reached a ceiling. The progression from reference-based metrics through CLIPScore to Polos thus traces a coherent methodological arc: decoupling evaluation from reference texts, then grounding it in human preference signals.

Persistent Gaps and Outstanding Questions

Despite these advances, important limitations remain. Neither CLIPScore [67] nor Polos [68] is designed to assess grounding fidelity — whether a generated caption accurately reflects the spatial, relational, and attributional structure of an image — as opposed to surface-level semantic alignment. Benchmarks such as VALSE [10], which probe linguistic phenomena including existence, plurality, and spatial relations, and Visual Spatial Reasoning [70], which systematically tests relational grounding across a controlled set of spatial predicates, reveal that standard metrics routinely fail to penalise models that capture global semantics while misrepresenting fine-grained structure. Datasets with fine-grained region-to-word links, of the type partially supplied by Visual Genome [63], remain sparse relative to the scale of image-caption training data [18]. The architectural variants catalogued by [4] and [3] cannot yet be compared fairly because standardized evaluation protocols that control for pretraining data scale are absent. The quantification challenge highlighted by [62] — developing principled methods for measuring each modality’s contribution to a multimodal prediction — remains largely unaddressed in standard evaluation practice, though recent modality contribution metrics discussed in Section 7 offer a partial remedy. Finally, benchmarks for temporal and causal multimodal understanding — extending beyond the static image-text pairs that anchor most existing infrastructure — remain underdeveloped [66, 4, 65], with video-language models such as those surveyed in [12] still lacking rigorous protocols for evaluating causal and world-model reasoning. These gaps define the active frontier where architectural, data, and evaluation innovations must develop in concert.

7. Grounding Failures, Spurious Correlations, and the Limits of Multimodal Understanding

Early work in visual question answering tended to evaluate model performance through aggregate accuracy metrics, a practice that obscured a troubling possibility: that models were answering correctly for fundamentally wrong reasons. The literature on grounding failures, spurious correlations, and the limits of multimodal understanding has grown into a sustained interrogation of whether vision-language systems genuinely perceive and reason about visual content, or whether they exploit statistical regularities in training data that happen to correlate with correct answers. The papers gathered under this theme collectively expose the fragility of apparent competence — across spatial biases, temporal reasoning, compositional generalization, and cross-modal alignment — while also proposing causal, contrastive, and mechanistic methods to move toward more reliable grounding. Crucially, the most recent work extends the diagnosis beyond behavioural observation: probing classifiers, modality contribution metrics, and representational diagnostics now identify the specific architectural loci at which grounding breaks down, transforming failure analysis from description to causal explanation.

Fundamental Visual Robustness Failures

A foundational concern is whether large vision-language models (LVLMs) respond consistently to the same semantic content when it appears under different visual conditions. The answer, emerging clearly from recent diagnostic work, is that they do not. [11] introduce V²R-Bench, a systematic evaluation framework that subjects LVLMs to controlled variations in object position, scale, and orientation. Their findings are striking in two respects. First, models exhibit a counterintuitive position bias favouring peripheral regions — objects presented near image edges are identified more reliably than those at the image centre, inverting any expectation based on training image statistics where salient objects typically occupy central positions [63]. Second, the models display what the authors describe as a human-like acuity threshold: performance degrades sharply when object size falls below a critical visual angle, paralleling biological acuity limits but arising from entirely different mechanisms rooted in patch tokenisation and training data distributions [4] rather than retinal architecture.

These findings matter because they reveal that LVLM responses are conditioned on low-level visual properties — spatial location and scale — that should, in principle, be irrelevant to semantic recognition tasks. When accuracy shifts as a function of where or how large an object appears rather than what the object is, the model is responding to spurious co-variates rather than stable visual concepts. This pattern is consistent with broader evidence that vision-language models exploit dataset-specific statistical regularities rather than acquiring genuine perceptual invariance [71, 12]. [11] frame this as a robustness deficit requiring architectural and training-level remedies, but the findings equally suggest that standard benchmark accuracy, which averages over such biases, is an unreliable proxy for genuine visual understanding [10, 17].

Spurious Cross-Modal Correlations in Video Question Answering

The problem of spurious correlation becomes considerably more acute in the temporal domain, where models must integrate visual evidence across frames rather than processing a single image. [72] provide a foundational diagnosis of this problem in the context of event-level video question answering, arguing that existing VQA systems rely on cross-modal co-occurrence statistics — superficial associations between linguistic question patterns and visual feature distributions — rather than genuinely modelling the causal, temporal, and dynamic structure of video events. Their analysis identifies a two-fold failure: first, models collapse temporal event sequences into static feature aggregations, losing the causal ordering that distinguishes, say, a fall causing an injury from an injury preceding a fall; second, the cross-modal binding between language and vision is mediated by correlational shortcuts rather than causal dependencies. This tendency to exploit statistical regularities rather than ground responses in visual content is consistent with the broader finding that VQA models can achieve high accuracy by leveraging linguistic priors alone, even when visual input is absent or uninformative [9, 17].

This diagnostic picture was extended and operationalised in [73], which introduces NExT-GQA — a dataset of 10.5K temporally grounded question-answer pairs designed to assess whether models can not only answer correctly but identify where in the video their answer is evidenced. The results are sobering: models that achieve strong performance on standard VideoQA benchmarks collapse on the grounding-required variant, revealing an enormous gap between surface accuracy and verifiable visual justification. Post-hoc attention analysis further demonstrates that correct answers are frequently supported by attention weights distributed over temporally irrelevant frames — the model is, in effect, guessing correctly while looking in the wrong place. This constitutes a direct empirical demonstration that benchmark success does not imply genuine multimodal understanding, a conclusion reinforced by parallel evaluations showing that vision-language models exhibit systematic robustness failures under even fundamental visual perturbations [11], with significant implications for how evaluation validity is assessed across the field.

Representational Diagnostics: Mechanistic Probing of Grounding Deficits

The behavioural failures documented above — positional biases, scale sensitivity, reliance on correlational shortcuts — describe what vision-language models get wrong. A complementary and rapidly maturing body of work has turned inward, using probing classifiers [16], modality contribution metrics, and controlled diagnostic benchmarks to ask why these failures occur and where in the model architecture they originate. The resulting picture reveals that grounding deficits are not undifferentiated — they are traceable to specific architectural bottlenecks, modality imbalances, and representational compression failures that standard evaluation obscures.

The earliest systematic effort to decompose end-task performance into contributions from specific linguistic and perceptual phenomena was the VALSE benchmark [10], which constructed minimal-pair image-caption contrasts — foils — across six constructs: existence, plurality, counting, spatial relations, actions, and coreference. The results were unambiguous: no tested model, including the strongest overall performer, reliably handled the full range of phenomena. Models performed reasonably on existence tasks involving named object detection but failed substantially on plurality, counting, and coreference, suggesting that these capacities depend on representational resources that contemporary architectures either lack or exploit inconsistently [10]. VALSE’s contribution was as much methodological as empirical: by targeting phenomena rather than tasks, it established a principled language for discussing representational gaps that aggregate benchmarks systematically conceal.

The Visual Spatial Reasoning (VSR) benchmark deepened this agenda by focusing exclusively on spatial relations — a capacity that presupposes both object localisation and the encoding of inter-object geometry [70]. Using contrastive captions derived from MS COCO images, VSR documented a striking performance gap: human accuracy exceeded 95%, whereas all tested VLMs failed to reach 70%, a deficit that proved robust across model families. Importantly, VSR identified a structural factor contributing to this gap: models equipped with explicit positional encodings (LXMERT [7] and ViLT) outperformed VisualBERT by more than ten percentage points, confirming that spatial reasoning is not an emergent free-rider on general multimodal training but instead depends on architectural commitments that encode positional structure [70]. This finding foreshadows a broader pattern: representational capacities track specific inductive biases rather than emerging automatically from scale [4].

Turning from what is represented to how much each modality contributes, MM-SHAP [17] introduced Shapley-value decomposition [74] from cooperative game theory to quantify each modality’s marginal contribution to a model’s predictions. The metric revealed that unimodal collapse — over-reliance on a single modality — is both common and asymmetric in direction. CLIP displayed the most balanced integration, with text contribution values consistently near 50% across tasks, while LXMERT exhibited a visual bias and ALBEF variants leaned textual [17]. These findings undermine the assumption that multimodal models are uniformly biased in one direction and underscore that architectural and training choices shape modality reliance in task- and model-specific ways. By decoupling contribution measurement from task performance, MM-SHAP also demonstrated that a model can achieve acceptable accuracy while relying almost entirely on one modality — a pathological regime that surface-level benchmarks would miss entirely. This directly instantiates the quantification challenge identified by [62] as one of the field’s most undertheorised problems, and demonstrates that principled modality attribution is not merely a theoretical desideratum but a practical diagnostic tool.

A complementary mechanistic account located a different failure point: the text encoder itself. Research on text encoder bottlenecks [15] introduced diagnostic datasets comprising 18,100 increasingly compositional text prompts to test whether CLIP-style single-vector text encoders can even represent compositional information before any vision-language integration occurs. The answer was largely negative. Text encoders were found to act as representational chokepoints, systematically suppressing information about spatial relations, temporal relations, negations, and numerical quantifiers when compressing sentences into fixed-length vectors [15]. A key asymmetric regularity emerged: when text reconstruction failed, multimodal failure followed in 96% of cases, but successful reconstruction did not guarantee multimodal success — indicating that text encoding is necessary but insufficient for compositional grounding. This result reframes the locus of explanation away from training data quantity or pretraining scale and toward architectural choices that precede multimodal fusion, suggesting that no amount of additional contrastive training on existing CLIP-style designs will resolve compositionality failures rooted in the encoder’s representational capacity. The finding connects directly to the visual robustness failures documented by [11]: if the text encoder cannot preserve the compositional distinction between “a red cube on a blue sphere” and “a blue cube on a red sphere,” then no amount of visual processing downstream can recover the lost information. This bottleneck is further consistent with mechanistic interpretability analyses [28] showing that fixed-dimensional sentence embeddings compress away relational and order-sensitive information under contrastive training pressure.

The most recent probing work has pushed the frontier toward evaluating whether VLMs possess anything resembling an internal world model — a structured representation of objects, their properties, causal relations, and spatiotemporal dynamics. The WM-ABench evaluation framework [12] decomposed world modelling into perception and prediction sub-components, testing capacities ranging from basic colour and shape recognition through 3D spatial reasoning and motion trajectory prediction. VLMs fell substantially below human performance across all dimensions, with particularly acute failures in 3D spatial reasoning and temporal/motion understanding, where some models approached chance-level performance [12]. A diagnostically important finding was the systematic entanglement of perceptual dimensions: models failed to represent colour and shape independently, with these properties confounded in ways that propagated errors across tasks. This entanglement suggests that VLMs do not maintain the factored, compositional internal representations that would be necessary to support genuine causal or predictive reasoning about the visual world — an absence traceable to the same representational compression that [15] identified at the text encoder level, but here manifesting on the visual side as well.

Parallel work on holistic shape processing approached representational structure from a different angle, exploiting visual anagrams — image pairs that share identical local patch statistics but differ in global configural organisation — to isolate sensitivity to long-range spatial structure [71]. The resulting Configural Shape Score (CSS) revealed a pronounced divergence among architectures: self-supervised and language-aligned Vision Transformers (DINOv2, SigLIP, EVA-CLIP) approached human-level configural sensitivity, while convolutional networks and some supervised ViTs lagged considerably. Mechanistic analysis identified long-range cross-patch attention — spanning more than two patches — as necessary and sufficient for high CSS, with a characteristic processing pivot occurring at intermediate network depths [71]. This attention-based account integrates with the spatial reasoning findings from VSR [70]: models that lack structural mechanisms for binding spatially separated information will predictably fail tasks requiring relational or configural processing, regardless of scale.

Taken together, these representational diagnostics establish a crucial analytical point: grounding failures documented at the behavioural level are not monolithic but decompose into identifiable architectural pathologies. Text encoder compression eliminates compositional distinctions before fusion occurs [15]; modality imbalance means that one signal dominates prediction even when both are nominally available [17]; perceptual dimension entanglement prevents factored attribute representation [12]; and insufficient long-range attention blocks configural and relational processing [71]. Each pathology suggests a distinct remedial intervention — structured or multi-vector text encoding, modality-balanced training objectives [4], disentangled visual representations, and attention architectures with explicit long-range connectivity — rather than the generic prescription of more data or larger models. The mechanistic evidence thus transforms the grounding failure literature from a catalogue of shortcomings into a diagnostic map with actionable engineering implications.

Compositional Generalization Failures as Diagnostic Grounding Probes

As established in Sections 3 and 4, standard neural architectures face a fundamental difficulty in achieving systematic compositional generalization — recombining known elements in novel configurations according to productive rules. What the multimodal grounding literature adds is a direct mapping from those formally documented failures to the specific ways VLMs break down in vision-language settings, and a set of mechanistic linkages that connect architectural bottlenecks to particular compositional dimensions.

The multimodal relevance of the COGS findings [13] and the five-dimensional compositionality framework [14] — already examined in detail in earlier sections — becomes sharpest when read alongside the representational diagnostics above. The distinction between lexical and structural generalization established by [13] maps directly onto VLM failure modes: recognising known objects in novel spatial configurations is the visual analogue of structural generalization [70], and it is precisely here that multimodal models fail most acutely. The text encoder bottleneck documented by [15] provides the mechanistic explanation: if compositional distinctions are lost during text encoding, the downstream model cannot even attempt structural generalization over linguistic input, regardless of its visual processing capacity. This links a formally characterised compositional failure mode to a specific architectural locus within the multimodal pipeline.

The five-dimensional decomposition of compositionality [14] similarly illuminates the pattern of VLM failures with greater precision than aggregate accuracy allows. A model may succeed on substitutivity — correctly binding known attributes to known objects — while failing on productivity, the capacity to generalise that binding to novel compositions. The perceptual dimension entanglement documented by [12], in which colour and shape are not independently represented, instantiates exactly this failure at the representational level: if these properties are confounded in the model’s visual encoding, recombination across novel colour-shape pairings becomes impossible in principle, not merely difficult in practice. Architectural analyses [22] confirm that no single modification to Transformer design resolves all compositional dimensions simultaneously, and the cross-dataset dissociation reported by [75] — where gains on one compositional split fail to transfer to analogous splits elsewhere — suggests that benchmark-specific grounding improvements in multimodal settings may be similarly fragile.

At the instance level, [76] identified the mechanistic driver of compositional difficulty as the presence of unobserved local structures in the target representation, with different architectures agreeing on instance-level difficulty at rates of 93–97%. This has direct implications for multimodal grounding: a VLM’s failure to interpret “the small red cube behind the large blue sphere” may reflect not a perceptual deficit per se, but the absence of that specific relational configuration from the training distribution — a compositional gap that scale alone cannot bridge. Spatial relation benchmarks corroborate this account, revealing systematic failures precisely on novel relational combinations absent from pretraining data [70].

The most theoretically consequential contribution comes from [23], who demonstrated that a meta-learning approach — training on dynamically generated few-shot compositional tasks — enabled standard architectures to achieve human-level compositional accuracy while reproducing characteristic human error patterns. For the multimodal grounding debate, this finding is significant: it suggests that persistent compositional failures in VLMs may reflect the passive, associative character of image-caption pretraining more than architectural limitations. This conclusion resonates with the Words As Social Tools hypothesis [42] and its emphasis on active, interactive acquisition as a prerequisite for robust concept formation — a view corroborated by developmental and robotic grounding research showing that symbol emergence depends critically on embodied, interactive experience [19] — and it finds formal expression in the algebraic standards articulated by [25] and [31], whose structure-preserving formalisms describe a level of compositional systematicity that current vision-language models do not approach.

The compositional generalization literature thus reframes the grounding failures documented throughout this section. Where visual robustness failures [11] reveal sensitivity to irrelevant perceptual variables, video grounding failures [73] reveal the substitution of correlational shortcuts for causal reasoning, and mechanistic probing reveals the specific architectural loci at which compositional information is lost [15, 17, 12], compositional benchmarks supply the formal vocabulary to diagnose these failures with precision: as deficits in specific dimensions — systematicity, productivity, substitutivity — that require distinct interventions rather than generic increases in model scale or training data.

Causal Intervention as a Path Forward

If spurious correlations are the diagnosis, causal intervention methods represent the field’s primary proposed therapy. [72] respond to their own diagnostic findings by proposing CMCIR, a framework that applies both front-door and back-door causal adjustment operations — formalisms rooted in Pearl’s do-calculus for non-parametric identification of causal effects [77] — across three interacting modules designed to discover latent causal structures between visual and linguistic modalities. The front-door criterion, in particular, is used to estimate the effect of cross-modal causal pathways while blocking confounding routes — a principled attempt to distinguish genuine visual evidence from correlated noise. Empirical results show improved performance on event-level QA tasks relative to correlation-based baselines, though the improvement necessarily raises a deeper question: does statistical debiasing via causal graph surgery constitute genuine grounding, or merely a more sophisticated form of statistical fitting that happens to be better calibrated?

This philosophical tension remains unresolved, and it surfaces again in [78], which proposes the Cross-modal Causal Relation Alignment (CRA) framework for video question grounding. CRA extends the causal intervention paradigm by incorporating bidirectional contrastive learning to align cross-modal representations [4]: the framework explicitly pushes apart spurious cross-modal pairings while drawing together causally relevant ones. By operating at the representational level rather than purely at the output distribution level, CRA offers a more architecturally integrated approach to spurious correlation suppression than post-hoc debiasing — an important distinction given that output-level debiasing has been shown to leave representational shortcuts intact [11]. [78] report improvements in robustness and generalisation on VideoQG benchmarks, and the bidirectional contrastive design is notable for its potential to catch asymmetric alignment failures that unidirectional contrastive methods would miss [64].

Gaps, Tensions, and Open Questions

Taken together, the works reviewed in this section establish that grounding failures are neither marginal nor benchmark-specific — they appear across image and video modalities, affect both spatial and temporal reasoning, manifest as failures of compositional systematicity in controlled linguistic settings, and persist in models large enough to achieve state-of-the-art accuracy. The mechanistic probing literature adds a further dimension: these failures are not merely observable at the output level but are traceable to identifiable representational bottlenecks — text encoder compression [15], modality imbalance [17], perceptual entanglement [12], and insufficient long-range attention [71] — that suggest specific remedial interventions rather than undirected scaling. Nevertheless, significant gaps remain. The diagnostic benchmarks introduced by [73] and [11] are confined to VQA and VideoQA settings; whether analogous grounding failures manifest in dialogue, instruction following, or long-form reasoning remains largely uninvestigated. The probing methodologies developed in [10], [70], and [17] have been applied predominantly to encoder-based architectures (CLIP, LXMERT [7], ViLBERT); their extension to the autoregressive generative VLMs that now dominate deployed systems [3, 4] remains a critical open frontier. The interaction between model scale and grounding robustness is also poorly understood: it is an open empirical question whether larger models exhibit systematically fewer spurious correlations or simply different ones that are harder to detect — a concern reinforced by findings that compositional generalisation failures persist even as model capacity increases [14, 23].

The finding that benchmark-specific performance gains in compositional settings do not transfer across datasets [75] suggests that apparent progress on specific grounding benchmarks may similarly fail to generalise — a possibility the field has not adequately tested. Work on unobserved local structures further demonstrates that such failures are structural rather than incidental, arising from systematic gaps between training distributions and evaluation conditions [76]. The algebraic formalisation of compositional requirements — exemplified by [25]‘s treatment of semantic parsing as a homomorphism between syntactic and semantic algebras, and by [31]‘s demonstration that syntactic Merge possesses the mathematical structure of a free magma and combinatorial Hopf algebra — establishes a formal standard against which the compositional behaviour of multimodal systems can be measured. Current vision-language models do not approach the kind of structure-preserving mappings that these formalisms describe, a gap that aggregate accuracy metrics systematically conceal.

Perhaps the deepest unresolved tension is whether the causal intervention methods proposed by [72] and [78] constitute genuine advances in grounding or more sophisticated debiasing. Causal graph frameworks assume that the relevant causal structure can be correctly specified — an assumption that the broader causal representation learning literature acknowledges as non-trivial even under controlled conditions [77] — and one that is itself untested against the messy contingency of real-world vision-language data. Finally, all current interventions operate at training time; methods capable of detecting and correcting grounding failures at inference time — without retraining — remain conspicuously absent from the literature, representing a practical frontier whose development will determine whether these theoretical advances translate into reliable deployed systems.

8. Domain-Specific Applications of Vision-Language Models

The deployment of vision-language models in specialized professional domains represents one of the most consequential frontiers in multimodal AI research, raising fundamental questions about whether general-purpose pretraining can substitute for expert domain knowledge, or whether domain-specific adaptation remains indispensable. Medical imaging, remote sensing, clinical neurocognitive assessment, and streaming video analysis have emerged as intensively studied application areas, offering complementary perspectives on how fine-grained visual perception, specialized vocabulary, and high reliability requirements reshape the architecture, training, and evaluation of VLMs. Across these domains, the central tension is the same: the gravitational pull toward ever-larger generalist models collides with substantial empirical evidence that targeted domain adaptation continues to yield decisive performance advantages.

Medical Image Analysis: From Fusion Architectures to General Large Models

Early work in applying language supervision to medical images drew primarily on relatively simple cross-modal fusion strategies, pairing visual encoders with textual descriptions from radiology reports or clinical notes [79]. This foundational phase established that paired image-text data drawn from clinical workflows could serve as a supervisory signal, but it also surfaced an enduring limitation: the scarcity and privacy constraints governing medical imaging datasets create a chronic bottleneck that no architectural innovation can fully circumvent [79]. The practical scale of this bottleneck is substantial — radiological services in major healthcare systems perform tens of millions of procedures annually [80], yet the corresponding imaging data remains largely siloed behind institutional access controls and consent frameworks, limiting the corpora available for model training. The survey by [79], encompassing more than 130 papers, provides a systematic mapping of how the field evolved from these modest origins toward the deployment of general large models, identifying seven principal characteristics of VLMs and tracing their alignment with downstream tasks including classification, segmentation, and automated report generation. Critically, this survey frames the trajectory not as a linear progression but as a negotiation between the expressive power unlocked by scale and the irreducible demands of clinical specificity.

The question of whether scale alone suffices received a pointed empirical answer with the development of CXR-LLaVA [81], a model built upon the LLaVA architecture and equipped with a custom vision transformer trained explicitly on chest radiograph data. On the MIMIC internal test set, CXR-LLaVA achieved an average F1 score of 0.81, substantially outperforming GPT-4-vision (F1: 0.62) and Gemini-Pro-Vision (F1: 0.68) [81]. This performance gap is not marginal — it represents a differential that would be clinically meaningful in real-world screening or triage contexts. Beyond classification metrics, [81] evaluated autonomous reporting success, defined as the proportion of generated reports requiring no revision or only minor changes, finding that CXR-LLaVA achieved a 72.7% autonomous success rate against a ground truth benchmark of 84.0%. The gap between these figures illuminates how much clinical deployment headroom remains even after domain-specific fine-tuning — the model is useful but not yet a drop-in replacement for radiologist oversight. Complementary work on interpretable report generation has further documented that VLMs are prone to hallucinations and clinically significant disagreements with domain experts, and that generated reports may not faithfully reflect the underlying computations of the image encoder — a concern that extends beyond accuracy into the trustworthiness required for clinical deployment [80]. Expert evaluation frameworks for clinical AI similarly underscore that models may achieve surface-level correctness while failing on the deeper reasoning strategies — such as Bayesian inference and heuristic judgment — that characterize expert diagnostic practice [82].

These findings from [81] sit in productive tension with the broader survey landscape documented by [79]. If frontier generalist models like GPT-4-vision underperform a domain-adapted open-weight model by nearly 20 F1 points on chest X-ray interpretation, this challenges the assumption that parameter scale and broader pretraining corpora will eventually render specialization unnecessary. The medical imaging survey contextualizes this by noting that domain-specific image-text datasets remain scarce due to institutional privacy constraints and the high cost of expert annotation, meaning that the data regimes available to domain-specific models are often far smaller than those available to commercial generalists — making the performance inversion even more striking [79].

Clinical Neurocognitive Assessment: Cross-Modal Alignment as Diagnostic Signal

A distinct clinical application of vision-language alignment targets Alzheimer’s disease detection through the analysis of patients’ verbal descriptions of the Cookie Theft picture, a standard neuropsychological elicitation instrument from the Boston Diagnostic Aphasia Examination widely used to probe spontaneous speech and discourse organisation. This task operationalises the well-established clinical observation that the quality, coherence, and semantic content of a patient’s speech about a visual scene serves as a reliable biomarker for cognitive decline — a principle increasingly validated through automated speech-language screening tools that extract lexical, syntactic, and fluency-related features to discriminate between healthy ageing and mild cognitive impairment [83] — and translates it into a computational alignment problem between image regions and transcribed speech segments. [84] leveraged BLIP, a pre-trained vision-language model, to generate image-text similarity features capturing the degree to which a patient’s verbal output semantically corresponds to the visual content of the stimulus picture. These similarity features were then integrated within a bipartite graph neural network that explicitly constructs links between image regions and transcribed sentence segments, making the alignment relationship itself the primary object of downstream classification rather than treating it as an implicit intermediate representation [84].

A diagnostically revealing methodological finding concerns the role of image colorisation: the standard grayscale Cookie Theft picture, when colorised as a preprocessing step, improved classification accuracy by 9.56% — a substantial margin attributable to the distributional mismatch between grayscale inputs and BLIP’s colour-dominated pretraining corpus [84]. This finding carries implications well beyond the specific clinical application: it foregrounds the sensitivity of pre-trained VLMs to low-level visual statistics, connecting directly to the visual robustness concerns documented in Section 7. The lightweight graph convolutional network ultimately achieved 88.73% accuracy on the ADReSSo Challenge dataset — a benchmark specifically designed for Alzheimer’s dementia recognition through spontaneous speech — surpassing the previous state-of-the-art and doing so with a model that prioritises the quality of the cross-modal alignment representation over parameter scale [84]. The approach is notable for its architectural philosophy: by externalising the vision-language correspondence into an inspectable graph topology, it creates a representation that can, in principle, be audited to identify which visual-linguistic misalignments are most diagnostically informative — an interpretability advantage that monolithic end-to-end models do not afford.

Remote Sensing: Contrastive Learning, Instruction Tuning, and Geospatial Reasoning

The remote sensing domain presents a structurally parallel set of challenges but within a context where some of the data scarcity constraints are less severe, enabling a different set of experimental investigations. The comprehensive survey by [85] identifies three principal VLM paradigms operating in this space: contrastive learning approaches that align satellite or aerial imagery with textual descriptions [3], instruction-tuned models adapted for geospatial question answering, and text-conditioned image generation systems capable of producing synthetic remote sensing imagery from natural language prompts. Each paradigm reflects distinct architectural assumptions and application targets, ranging from scene classification and object detection to change captioning and spatial reasoning over multi-temporal imagery [85].

A central empirical finding from [85] is that models trained on large-scale remote sensing image-text datasets — specifically datasets such as RS5M and VersaD — substantially outperform those relying on fine-tuning alone. RS5M, for instance, comprises approximately five million remote sensing image-text pairs assembled through semi-automated filtering of publicly available satellite imagery [85], providing a scale of paired supervision rarely achievable in privacy-constrained domains. The contrastive pretraining strategy employed by such datasets mirrors the approach of general-purpose frameworks like CLIP [3], but anchors alignment within the distinctive visual vocabulary of overhead imagery — where spatial scale, viewing angle, and spectral characteristics differ substantially from natural image distributions. This finding directly echoes the medical imaging literature’s emphasis on domain-specific data [79], and together the two surveys converge on a consistent principle: architectural transfer from general pretraining provides a useful initialization, but domain-specific visual knowledge acquisition through large-scale paired data is the primary driver of downstream task performance [85, 79]. The divergence between the two domains lies in the feasibility of constructing such datasets: remote sensing imagery, while requiring expert interpretation, does not carry the same privacy encumbrances as clinical data [79], allowing larger corpora to be assembled and publicly released. Large-scale multimodal pretraining frameworks [3] further contextualise this finding, confirming that data scale and domain alignment jointly determine the ceiling of transferable visual grounding.

Streaming Video Analysis: Temporal Grounding Under Real-Time Constraints

Dense video captioning presents a demanding instantiation of cross-modal alignment that extends beyond the static image-text pairs dominating most VLM research: it requires precise temporal localisation of event boundaries within continuous video streams alongside semantic correspondence between visual frames and natural language descriptions [4]. The CMSTR-ODE framework [86] addresses this through two architectural innovations. First, a Neural ODE-based Temporal Localisation module models event boundary detection as continuous feature evolution rather than discrete frame classification — a choice reflecting growing recognition that temporal structure is fundamentally continuous and benefits from representations that respect this property [65]. Second, a cross-modal memory bank containing 10,000 entries dynamically enriches video feature representations with stored textual context during inference, enabling the system to draw on previously observed event-language associations without re-processing prior video segments and directly enabling streaming throughput of 15 frames per second [86].

On the ActivityNet Captions benchmark, CMSTR-ODE achieves improvements over prior state-of-the-art across CIDEr, BLEU-4, ROUGE, and METEOR simultaneously — a multi-metric superiority that is meaningful because each metric captures distinct aspects of caption quality: CIDEr rewards consensus with human-generated references, BLEU-4 measures n-gram precision, ROUGE emphasises recall of key content, and METEOR additionally accounts for stemming and synonym overlap [67, 18, 86]. The architectural logic of the cross-modal memory bank deserves particular scrutiny: rather than relying solely on within-sequence attention, CMSTR-ODE externalises part of the alignment computation into a persistent, updatable memory structure, a strategy that resonates with broader findings on the value of explicit cross-modal representations in video-language pre-training [64, 3]. Together with the bipartite graph architecture of [84], this design reflects a broader trend toward structurally explicit alignment mechanisms — representations in which the cross-modal correspondence is encoded in an inspectable, retrievable form rather than diffused across the weights of a monolithic network. The implication for applied systems is substantial: structurally explicit alignment affords clearer intervention points for debugging, domain adaptation, and interpretability auditing [87, 88] — qualities increasingly demanded as multimodal systems move into high-stakes deployment environments [4].

Domain Gap and Transfer Limitations

Both the medical and remote sensing surveys foreground the domain gap problem as the central structural challenge for deploying general-purpose VLMs in expert contexts. In medical imaging, this gap manifests as failures in fine-grained perception — distinguishing subtle radiographic findings that require not only visual acuity but domain-specific visual priors that general pretraining data rarely instantiates [79, 80]. Domain-adapted architectures such as CXR-LLaVA demonstrate that targeted pretraining on clinical imagery can substantially recover this lost capacity, but at the cost of generality [81]. In remote sensing, the gap appears as spatial reasoning failures, particularly in tasks requiring understanding of object scale, geospatial relationships, and the interpretation of multi-spectral or synthetic aperture radar (SAR) imagery that bears little visual resemblance to natural photographs [85]. The broader visual robustness literature confirms that even fundamental low-level visual variations — changes in lighting, orientation, or imaging modality — can substantially degrade VLM performance, suggesting the domain gap is not merely semantic but deeply perceptual [11]. The clinical colorisation finding from [84] provides a particularly granular illustration: even a low-level statistical mismatch between test-time visual input and pretraining distribution can produce nearly 10% accuracy degradation, a sensitivity that connects directly to the visual robustness failures documented across general-purpose VLMs in Section 7. Expert evaluation of LLMs in diagnostic pathology further confirms that standard accuracy metrics systematically overestimate clinical utility [82]. The practical implication across all domains is that benchmark performance on general VLM evaluations is a poor predictor of performance on specialist tasks [12], reinforcing the need for domain-calibrated evaluation suites rather than reliance on general-purpose leaderboards.

Reliability, Validation, and Outstanding Challenges

The autonomous reporting success rate reported by [81] — 72.7% versus 84.0% for ground truth — exemplifies the kind of clinically grounded metric that the field needs more of, yet such longitudinal and deployment-oriented evaluations remain rare. Neither domain has yet produced systematic assessments of VLM hallucination rates under real-world operational conditions, a gap that is particularly concerning given the high-stakes consequences of confabulated findings in radiology or misidentified geospatial objects in intelligence applications [79, 85]. Hallucination in large vision-language models — spanning object confabulation, attribute misattribution, and relational errors — is increasingly documented as a function of distributional mismatch between pretraining corpora and deployment contexts [69, 89], yet controlled-benchmark evaluation protocols systematically underestimate the severity of these failures under real operational noise [11]. The modality contribution diagnostics developed by [17] could be productively deployed in these applied settings to determine whether domain-specific VLMs achieve their performance through genuine cross-modal integration or through unimodal shortcuts — a question whose answer carries direct implications for clinical trust. The absence of methods for incorporating continuous expert feedback into VLM training pipelines further limits the ability of deployed systems to self-correct over time [89]. Spatial reasoning, an additional and well-documented axis of failure [70], is particularly consequential in geospatial analysis, where relational misidentification of object positions, orientations, or proximities cannot be disambiguated post hoc from textual outputs alone. Finally, the near-exclusive focus on medical imaging and remote sensing as the two paradigmatic application domains leaves substantial intellectual territory unexplored — including materials science, agricultural monitoring, and legal document analysis — where the same tensions between general capability and domain specificity are likely to recur in analogous forms [79, 85].

Taken together, the applied evidence reviewed in this section demonstrates that the grounding failures systematically documented in general-purpose VLMs are not confined to controlled benchmarks but recur — and carry amplified consequences — in high-stakes professional settings. The near-10% accuracy penalty for a low-level distributional mismatch in a clinical tool, the persistent gap between domain-adapted and generalist model performance in radiology, and the spatial reasoning failures in geospatial analysis [70] all constitute applied instantiations of the representational bottlenecks and spurious correlation patterns identified in Section 7. At the same time, the structurally explicit alignment mechanisms emerging across these domains — bipartite graph representations, cross-modal memory banks, and interpretability-oriented architectures — point toward a research direction in which cross-modal correspondence is made auditable rather than opaque [11]. These applied findings thus contribute more than domain-specific performance numbers to the broader synthesis: they establish that the theoretical stakes of the grounding problem are fully realized in deployment, and that interpretability and robustness are not optional engineering refinements but necessary preconditions for the reliable operation of multimodal systems where errors carry real-world costs.

9. Discussion

The body of work synthesized in this review marks a genuine inflection point in how researchers conceptualize the relationship between language, perception, and meaning in artificial systems. Taken together, the findings across theoretical, computational, architectural, and applied domains reveal both remarkable progress and a set of persistent tensions that the field has not yet resolved — and in some cases has only recently begun to articulate with precision.

Perhaps the most significant shift in collective understanding over the past two to three years concerns what failure reveals. Earlier debates about grounding in AI systems tended to center on capability benchmarks: can a model match an image to a caption, answer a visual question, or localize a referred object? Recent work has moved decisively toward interrogating how these capabilities are achieved, and the picture that emerges is considerably more complex. Evidence of systematic grounding failures — models succeeding through spurious statistical associations, dataset artifacts, and linguistic shortcuts rather than genuine cross-modal integration — has reframed success on standard benchmarks from evidence of understanding to evidence of pattern exploitation. This is not merely a technical finding; it bears directly on the symbol grounding problem as classically formulated. The Harnadean intuition that symbols unmoored from perceptual contact remain meaningless finds a contemporary empirical echo in demonstrations that high benchmark performance can coexist with striking failures on minimally modified inputs designed to require actual visual reasoning. Bender and Koller’s more recent articulation of the form-meaning distinction [30] sharpens this point for the neural era: if meaning cannot be recovered from distributional form alone, then the interpretive burden falls on researchers to demonstrate that multimodal training adds something beyond richer statistical form — a demonstration that the grounding failures documented across this review suggest has not yet been achieved. The field now understands, in a way it arguably did not three years ago, that architectural access to visual signals does not automatically produce visual grounding in any semantically substantive sense.

From Behavioural Observation to Mechanistic Diagnosis

A decisive advance of the 2023–2025 period is the transition from documenting grounding failures behaviourally to explaining them mechanistically. Where earlier work could demonstrate that a model answered a spatial reasoning question incorrectly, the representational probing literature now reveals why: text encoder compression eliminates compositional distinctions before cross-modal fusion even begins [15]; modality imbalance means that visual evidence is systematically underweighted relative to linguistic priors [17, 11]; perceptual dimensions such as colour and shape are entangled rather than factored in the visual encoding [12]; and insufficient long-range attention prevents configural and relational processing that spatial reasoning requires [71, 70]. This convergence between mechanistic and behavioural evidence is among the most analytically productive developments the review documents. It transforms interpretability from a post-hoc curiosity into a diagnostic instrument that can identify where to intervene, distinguishing between a model that lacks a representation entirely and one that has the representation but fails to deploy it during generation [28].

The distinction between these two failure modes carries direct practical implications. If a model’s text encoder cannot represent the compositional distinction between “a cat on a mat” and “a mat on a cat” — as the findings of [15] suggest is the case for single-vector CLIP-style encoders — then no downstream architectural improvement can recover the lost information. The intervention must occur at the encoder level: structured or multi-vector text representations that preserve relational binding [24, 53], or neuro-symbolic approaches that maintain explicit compositional structure throughout encoding [90]. Conversely, if modality contribution analysis reveals that a model possesses relevant visual representations but systematically ignores them in favour of linguistic priors [17, 10], the intervention is different — modality-balanced training objectives or inference-time reweighting [11]. Standard benchmark accuracy conflates these categorically different failure modes, which is why the quantification challenge identified by [62] and operationalised by tools such as MM-SHAP [17] represents a genuine methodological advance rather than merely a refinement.

Compositionality as the Central Diagnostic

The compositional generalization literature provides formal precision for this diagnosis. Controlled benchmarks such as COGS [13] and the decomposed evaluation framework of [14] demonstrate that neural models, including Transformers, routinely achieve near-perfect in-distribution accuracy while failing catastrophically on compositionally novel structures — a pattern that exactly parallels the grounding failures documented in vision-language systems. This in-distribution/out-of-distribution accuracy gap is not an artifact of insufficient training: work on making Transformers solve compositional tasks [22] confirms that standard architectural choices systematically reproduce the failure even under increased training, and that closing the gap requires deliberate structural interventions such as architecture modifications and compositional attention mechanisms. The five-dimensional decomposition of compositionality into systematicity, productivity, substitutivity, localism, and overgeneralization [14] provides the field with a vocabulary for diagnosing which specific compositional capacities a model lacks, rather than treating grounding failure as a monolithic phenomenon. Applied to multimodal systems, this framework reveals that failures of novel attribute-object binding (a productivity failure), failures of spatial relational reasoning (a systematicity failure [70]), and failures to generalise across synonymous expressions (a substitutivity failure) are distinct competence deficits requiring distinct interventions — a diagnostic granularity that aggregate VQA accuracy scores [9] entirely obscure. Linguistic benchmarks centred on specific phenomena, such as VALSE [10], begin to recover some of this granularity, but remain limited in scope compared to a full compositional decomposition.

The mechanistic probing evidence now allows these compositional deficits to be linked to specific representational pathologies. The text encoder bottleneck documented by [15] provides a concrete mechanistic account of why productivity failures occur: if the compositional structure of a novel phrase is lost during encoding, the model cannot attempt productive generalisation regardless of its downstream capacity. The perceptual dimension entanglement found by [12] explains why substitutivity failures manifest on the visual side: if colour and shape are not independently represented, substituting one attribute value for another in a novel combination is representationally impossible. The spatial reasoning deficits documented by [70] can be traced to the absence of explicit positional encodings in certain architectures — a systematicity failure rooted in architectural design rather than training data, and one that neuro-symbolic approaches have begun to address through explicit spatial relation representations [90]. This three-way correspondence between compositional dimension, behavioural failure, and representational mechanism constitutes a genuinely new analytical contribution that neither the compositional generalization literature nor the probing literature could have produced independently.

The classical challenge posed by Fodor and Pylyshyn — that genuine cognitive systems must exhibit systematicity as a constitutive property rather than as an accidental by-product of training — thus reasserts itself with renewed empirical force [23]: multimodal models that exhibit compositional behaviour on familiar patterns but fail on novel recombinations have not satisfied the systematicity criterion in any principled sense [22, 23]. Lake and Baroni’s [23] meta-learning approach demonstrates that human-like systematic generalization remains achievable only through deliberate inductive bias, not scale alone. Unobserved local structures in training data further compound this failure, as models cannot generalise compositionally over substructures they have not encountered in isolation [76] — a finding that tightens the connection between data coverage limitations and the deeper architectural inadequacies the Fodor–Pylyshyn challenge foregrounds. Algebraic recombination approaches [25] offer one structural path forward, but the fundamental tension between statistical learning and principled compositionality remains unresolved.

The Concrete-Abstract Asymmetry and Its Theoretical Implications

This insight connects productively to findings from multimodal distributional semantics. Visually grounded representations demonstrably improve model performance on concrete, imageable concepts — the gains are well-documented and theoretically expected under embodied and simulation-based accounts of conceptual representation. What remains conspicuously underaddressed is the grounding of abstract language in computational systems. Justice, freedom, causality, obligation: these concepts occupy the vast majority of morally and practically consequential discourse, yet computational implementations of how they become grounded remain nascent. Crucially, however, the gap here is not theoretical in the way the field sometimes assumes. Cognitive science has produced substantive accounts of abstract concept grounding — Conceptual Metaphor Theory’s systematic cross-domain mappings from concrete to abstract [45, 44], the experientialist framework’s account of image schemas as bridging structures between bodily experience and conceptual thought [46], the Words As social Tools hypothesis emphasising linguistic and social acquisition routes [42, 50], and pluralistic frameworks proposing complementary embodied and linguistic representational systems [39] — each generating specific, testable predictions about when and how visual grounding should benefit abstract concepts. Safron’s active inference framework [35] adds a further dimension to this challenge by arguing that genuine grounding requires not merely perceptual input but the sensorimotor loops of embodied action — a condition that passive image-text pretraining cannot satisfy and that may be particularly relevant for abstract concepts whose grounding, on any theoretical account, requires more than visual co-occurrence.

What is missing is not a theory of abstract grounding but the operationalisation of these theories in multimodal architectures. Conceptual Metaphor Theory predicts, for instance, that systems exposed to structured pairings of concrete source-domain imagery with abstract target-domain language — ANGER as HEAT, TIME as SPACE, UNDERSTANDING as SEEING — should acquire more robust representations of those abstract concepts than systems trained on text alone or on unstructured image-text co-occurrence [45, 44]. The CMT prediction is not merely that multimodal training helps with abstract concepts in some diffuse sense, but that it should help selectively and structurally, along the specific mappings that metaphor theory identifies. A pluralistic architecture inspired by [39] would further predict that certain abstract concepts — those with stronger experiential or metaphorical grounding — should respond more to visual signal, while others whose acquisition is more exclusively linguistic and social [42, 50] should remain relatively insensitive. These are falsifiable architectural predictions, and the absence of benchmarks designed to test them constitutes one of the clearest methodological gaps this review has identified.

Synthesis: Four Research Questions Revisited

The review was organised around four research questions, and the accumulated evidence now permits summary answers to each. First, regarding what computational grounding requires beyond statistical co-occurrence: the evidence converges on the conclusion that co-occurrence, even across modalities, is insufficient. The systematic failures on compositionally novel inputs [13, 14], the persistence of modality imbalance even in architectures with dedicated cross-modal fusion [17], and the encoder-level compression losses documented by [15] collectively demonstrate that structural and relational information is routinely discarded or underweighted in current systems — losses that no amount of additional co-occurrence data can remedy.

Second, regarding whether current architectures achieve grounding in any semantically substantive sense: the answer must be qualified but largely negative for compositionally demanding tasks. Models achieve impressive in-distribution performance, but the mechanistic evidence reveals that this performance is frequently underpinned by unimodal shortcuts, distributional heuristics, and the exploitation of dataset-specific regularities rather than genuine cross-modal integration [10, 11]. The perceptual entanglement documented by [12], the spatial reasoning failures traced to architectural design by [70], and the sensitivity to surface-level distributional shifts identified in applied domains — including remote sensing [85] and medical imaging [81] — together paint a picture of systems that have learned to behave as if grounded under familiar conditions rather than systems that have acquired the representational structures grounding requires.

Third, regarding the role of embodiment and sensorimotor experience: the evidence supports the position that passive image-text pretraining falls short of the conditions that theorists from multiple traditions identify as necessary for full grounding [35, 46, 22, 23]. Whether this shortfall is principled — reflecting an irreducible dependence on sensorimotor loops — or practical — reflecting contingent limitations of current data and architectures that richer multimodal pipelines could address — remains genuinely open. The review does not resolve this debate, but it does sharpen its empirical terms.

Fourth, regarding evaluation: the field’s current reliance on aggregate accuracy metrics is demonstrably inadequate [10]. The five-dimensional compositional framework [14], the modality contribution diagnostics of [17], and the quantification agenda articulated by [62] collectively define the contours of a more adequate evaluative approach — one that combines mechanistic interpretability tools [28] with behaviourally decomposed diagnostics rather than treating benchmark accuracy as a proxy for understanding.

Toward an Adequate Evaluative Framework

The evaluative framework implied by these findings has both methodological and institutional dimensions. Methodologically, it requires the integration of three layers of evidence that current practice tends to treat in isolation. Behavioural diagnostics — compositionally controlled benchmarks, counterfactual probes, distribution-shift sensitivity analyses — identify what a system fails to do. Mechanistic analyses — representational probing, attention attribution, modality contribution analysis via tools such as MM-SHAP [17] — identify where in the architecture the failure originates. And theoretical grounding — asking whether the failure corresponds to a violation of systematicity [22, 23], a compression loss [15], or a metaphorical mapping deficit [45, 44] — identifies what kind of deficit is present and what class of intervention is therefore indicated. No single layer is sufficient; the diagnostic power arises from their combination.

Institutionally, this framework implies the need for evaluation suites that are co-designed with domain experts in high-stakes application areas, regularly updated to resist the dataset contamination and shortcut learning that static benchmarks inevitably accumulate [11, 12], and explicitly calibrated to distinguish in-distribution pattern matching from out-of-distribution generalisation [13, 14]. The domain gap findings from medical imaging [81] and remote sensing applications [85] illustrate precisely what happens when these conditions are not met: general-purpose leaderboard performance provides no reliable signal about specialist deployment behaviour. In remote sensing, for instance, models pre-trained on web-scale data require substantial domain-specific re-alignment — via targeted datasets, novel pre-training objectives, and architecture modifications — before achieving reliable performance on even standard downstream tasks such as scene classification and cross-modal retrieval [85].

Forward-Looking Priorities

Looking ahead, several theoretical and engineering priorities emerge with particular urgency from this synthesis. On the engineering side, the most pressing need is for encoder architectures that preserve compositional and relational structure — structured or multi-vector representations that do not collapse relational bindings into a single vector, and fusion mechanisms that actively enforce modality balance rather than allowing linguistic priors to dominate by default [17, 15]. Compositional generalization benchmarks such as COGS [13] and related diagnostics [14] have demonstrated that this failure mode is systematic rather than incidental, underscoring the urgency of architectural remedies. Equally urgent is the development of training objectives and data curation strategies informed by CMT and related cognitive theories [45, 44, 46]: if structured metaphorical mappings are indeed the mechanism through which abstract concepts acquire grounded meaning, then training pipelines should be designed to instantiate those mappings rather than relying on their incidental emergence from unstructured web-scale corpora — a reliance that, as Bender and Koller [30] influentially argued, risks conflating statistical co-occurrence with genuine semantic understanding.

On the theoretical side, the most consequential open question is whether the systematicity criterion [22, 23] can be satisfied by any architecture that learns from distributional evidence alone, or whether its satisfaction requires structural inductive biases — explicit relational representations, compositional modules [25], or sensorimotor grounding [5, 33] — that current deep learning paradigms do not provide. The pluralistic framework [39] suggests that this may not be a binary choice: different concepts may require different grounding routes, and architectures capable of routing flexibly between embodied, metaphorical, and linguistic grounding mechanisms may prove more adequate than any single-pathway approach [34]. Testing this prediction requires both theoretical elaboration and the controlled experimental infrastructure — benchmarks, probing methods, and modality contribution diagnostics [17, 12, 11] — that the 2023–2025 period has begun, but not yet completed, to build. Linguistic phenomena-centered benchmarks such as VALSE [10] exemplify the kind of targeted diagnostic infrastructure that this programme requires, probing specific linguistic and perceptual capacities rather than aggregate performance.

The symbol grounding problem, in its classical formulation [1, 19], asked whether symbols could ever be more than tokens in a formal game — whether they could acquire the kind of contact with the world that makes genuine reference possible. This review suggests that the question has not been answered in the affirmative by current multimodal systems, but that the field is, for the first time, asking it with sufficient empirical precision to make the answer discoverable. That is genuine progress, and it sets a clear agenda for the years ahead.


References

[1] Coradeschi, S., Loutfi, A., Wrede, B. (2012). A Short review of Symbol Grounding in Robotic and Intelligent Systems. KI - Künstliche Intelligenz. [2] Taniguchi, T., Nagai, T., Nakamura, T., et al. (2015). Symbol Emergence in Robotics: A Survey. Advanced Robotics. [3] Wang, X., Chen, G., Qian, G., et al. (2023). Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey. Machine Intelligence Research. [4] Xu, P., Zhu, X., Clifton, D. (2023). Multimodal Learning with Transformers: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. [5] Andrews, M., Frank, S., Vigliocco, G. (2014). Reconciling Embodied and Distributional Accounts of Meaning in Language. Topics in Cognitive Science. [6] Barsalou, L. (2010). Grounded Cognition: Past, Present, and Future. Topics in Cognitive Science. [7] Tan, H., Bansal, M. (2019). LXMERT: Learning Cross-Modality Encoder Representations from Transformers. [8] Li, G., Duan, N., Fang, Y., et al. (2020). Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. Proceedings of the AAAI Conference on Artificial Intelligence. [9] Agrawal, A., Lu, J., Antol, S., et al. (2015). VQA: Visual Question Answering. arXiv (Cornell University). [10] Parcalabescu, L., Cafagna, M., Muradjan, L., et al. (2021). VALSE : A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [11] Fan, Z., Wang, Y., Polisetty, S., et al. (2025). Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward. [12] Gao, Q., Pi, X., Liu, K., et al. (2025). Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation. [13] Kim, N., Linzen, T. (2020). COGS: A Compositional Generalization Challenge Based on Semantic Interpretation. [14] Hupkes, D., Dankers, V., Mul, M., et al. (2020). Compositionality Decomposed: How do Neural Networks Generalise?. Journal of Artificial Intelligence Research. [15] Kamath, A., Hessel, J., Chang, K. (2023). Text encoders bottleneck compositionality in contrastive vision-language models. [16] Kunz, J. (2024). Understanding Large Language Models: Towards Rigorous and Targeted Interpretability Using Probing Classifiers and Self-Rationalisation. Linköping studies in science and technology. Dissertations. [17] Pârcălăbescu, L., Frank, A. (2023). MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks. [18] Chrupała, G. (2021). Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques. Journal of Artificial Intelligence Research. [19] Taniguchi, T., Piater, J., Wörgötter, F., et al. (2018). Symbol Emergence in Cognitive Developmental Systems: A Survey. IEEE Transactions on Cognitive and Developmental Systems. [20] Pezzulo, G., Barsalou, L., Cangelosi, A., et al. (2011). The mechanics of embodiment: a dialog on embodiment and computational modeling. Frontiers in Psychology. [21] Pezzulo, G., Barsalou, L., Cangelosi, A., et al. (2013). Computational Grounded Cognition: a new alliance between grounded cognition and computational modeling. Frontiers in Psychology. [22] Ontañón, S., Ainslie, J., Fisher, Z., et al. (2021). Making Transformers Solve Compositional Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [23] Lake, B., Baroni, M. (2023). Human-like systematic generalization through a meta-learning neural network. Nature. [24] Kamali, D., Barezi, E., Kordjamshidi, P. (2025). NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization. Proceedings of the AAAI Conference on Artificial Intelligence. [25] Liu, C., An, S., Lin, Z., et al. (2020). Learning Algebraic Recombination for Compositional Generalization. [26] An, S., Lin, Z., Fu, Q., et al. (2023). How Do In-Context Examples Affect Compositional Generalization?. [27] Chrupała, G., Gelderloos, L., Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. [28] Bereska, L., Gavves, E. (2024). Mechanistic Interpretability for AI Safety A Review. arXiv (Cornell University). [29] Stevan Harnad (1990). The symbol grounding problem. Physica D Nonlinear Phenomena. [30] Bender, E., Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. [31] Marcolli, M., Chomsky, N., Berwick, R. (2025). Mathematical Structure of Syntactic Merge: An Algebraic Model for Generative Linguistics. The MIT Press eBooks. [32] Gallese, V., Lakoff, G. (2005). The Brain’s Concepts: The Role of the Sensory-Motor System in Conceptual Knowledge. Cognitive Neuropsychology. [33] Bechtold, L., Cosper, S., Malyshevskaya, A., et al. (2023). Brain Signatures of Embodied Semantics and Language: A Consensus Paper. Journal of Cognition. [34] Körner, A., Castillo, M., Drijvers, L., et al. (2023). Embodied Processing at Six Linguistic Granularity Levels: A Consensus Paper. Journal of Cognition. [35] Safron, A. (2021). The Radically Embodied Conscious Cybernetic Bayesian Brain: From Free Energy to Free Will and Back Again. Entropy. [36] Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences. [37] Pezzulo, G., Parr, T., Friston, K. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. Biological Psychology. [38] Binder, J. (2016). In defense of abstract conceptual representations. Psychonomic Bulletin & Review. [39] Dove, G. (2011). On the need for embodied and dis-embodied cognition. Frontiers in Psychology. [40] Andrea E. Martin (2020). A Compositional Neural Architecture for Language. Journal of Cognitive Neuroscience. [41] Stern, M. (2025). Dynamic Field Theory unifies discrete and continuous aspects of linguistic representations. Proceedings of the Linguistic Society of America. [42] Borghi, A., Barca, L., Binkofski, F., et al. (2019). Words as social tools: Language, sociality and inner grounding in abstract concepts. Physics of Life Reviews. [43] Zuanazzi, A., Ripollés, P., Lin, W., et al. (2024). Negation mitigates rather than inverts the neural representations of adjectives. PLoS Biology. [44] Lakoff, G. (2014). Mapping the brain’s metaphor circuitry: metaphorical thought in everyday reason. Frontiers in Human Neuroscience. [45] Lakoff, G., JOHNSON, M. (1980). The Metaphorical Structure of the Human Conceptual System. Cognitive Science. [46] Kevin M. Clark (2024). Embodied Imagination: Lakoff and Johnson’s Experientialist View of Conceptual Understanding. Review of General Psychology. [47] Wilson, N., Gibbs, R. (2007). Real and Imagined Body Movement Primes Metaphor Comprehension. Cognitive Science. [48] Boulenger, V., Hauk, O., Pulvermüller, F. (2008). Grasping Ideas with the Motor System: Semantic Somatotopy in Idiom Comprehension. Cerebral Cortex. [49] Friedrich, J., Raab, M., Voigt, L. (2025). Grounded cognition and the representation of momentum: abstract concepts modulate mislocalization. Psychological Research. [50] Anna M. Borghi, Claudia Mazzuca, Luca Tummolini (2025). The role of social interaction in the formation and use of abstract concepts. [51] Tellex, S., Kollar, T., Dickerson, S., et al. (2011). Approaching the Symbol Grounding Problem with Probabilistic Graphical Models. AI Magazine. [52] Mao, J., Gan, C., Kohli, P., et al. (2019). The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. DSpace@MIT (Massachusetts Institute of Technology). [53] Hsu, J., Mao, J., Tenenbaum, J., et al. (2023). What’s Left? Concept Grounding with Logic-Enhanced Foundation Models. arXiv (Cornell University). [54] Barnaby, C., Chen, Q., Ramalingam, R., et al. (2025). Active Learning for Neurosymbolic Program Synthesis. Proceedings of the ACM on Programming Languages. [55] Saxena, P., Raghuvanshi, N., Goveas, N. (2025). UAV-VLN: End-to-End Vision Language guided Navigation for UAVs. [56] Schumann, R., Zhu, W., Feng, W., et al. (2024). VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View. Proceedings of the AAAI Conference on Artificial Intelligence. [57] Bruni, E., Tran, N., Baroni, M. (2014). Multimodal Distributional Semantics. Journal of Artificial Intelligence Research. [58] Morency, L., Baltrušaitis, T. (2016). Multimodal Machine Learning: Integrating Language, Vision and Speech. [59] Harwath, D., Glass, J. (2016). Learning word-like units from joint audio-visual analysis. [60] Tenney, I., Das, D., Pavlick, E. (2019). BERT Rediscovers the Classical NLP Pipeline. [61] Jawahar, G., Sagot, B., Seddah, D. (2019). What does BERT learn about the structure of language?. [62] Paul Pu Liang, Amir Zadeh, Louis–Philippe Morency (2024). Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. ACM Computing Surveys. [63] Krishna, R., Zhu, Y., Groth, O., et al. (2017). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision. [64] Xu, H., Ghosh, G., Huang, P., et al. (2021). VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. [65] Li, L., Chen, Y., Cheng, Y., et al. (2020). HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. [66] Laura Orynbay, Bibigul Razakhova, Peter Peer, et al. (2024). Recent Advances in Synthesis and Interaction of Speech, Text, and Vision. Electronics. [67] Hessel, J., Holtzman, A., Forbes, M., et al. (2021). CLIPScore: A Reference-free Evaluation Metric for Image Captioning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. [68] Yuiga Wada, Kanta Kaneda, Daichi Saito, et al. (2024). Polos: Multimodal Metric Learning from Human Feedback for Image Captioning. [69] Zhang, Y., Li, Y., Cui, L., et al. (2025). Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics. [70] Liu, F., Emerson, G., Collier, N. (2023). Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics. [71] Doshi, F., Fel, T., Konkle, T., et al. (2025). Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models. ArXiv.org. [72] Yang Liu, Guanbin Li, Liang Lin (2023). Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence. [73] Junbin Xiao, Angela Yao, Yicong Li, et al. (2024). Can I Trust Your Answer? Visually Grounded Video Question Answering. [74] Lundberg, S., Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. arXiv (Cornell University). [75] Shaw, P., Chang, M., Pasupat, P., et al. (2020). Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?. [76] Bogin, B., Gupta, S., Berant, J. (2022). Unobserved Local Structures Make Compositional Generalization Hard. [77] Schölkopf, B., Locatello, F., Bauer, S., et al. (2021). Toward Causal Representation Learning. Proceedings of the IEEE. [78] Weixing Chen, Y. Liu, Binglin Chen, et al. (2025). Cross-modal Causal Relation Alignment for Video Question Grounding. [79] Li, X., Li, L., Jiang, Y., et al. (2025). Vision-Language Models in medical image analysis: From simple fusion to general large models. Information Fusion. [80] Abdulaal, A., Fry, H., Montaña-Brown, N., et al. (2024). AN X-RAY IS WORTH 15 FEATURES: SPARSE AUTOENCODERS FOR INTERPRETABLE RADIOLOGY REPORT GENERATION. arXiv (Cornell University). [81] Lee, S., Youn, J., Kim, H., et al. (2025). CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. European Radiology. [82] Ullah, E., Waqas, A., Khan, A., et al. (2025). REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY. [83] Ruzi, R., Pan, Y., Ng, M., et al. (2025). A Speech-Based Mobile Screening Tool for Mild Cognitive Impairment: Technical Performance and User Engagement Evaluation. Bioengineering. [84] Lee, B., Bang, J., Song, H., et al. (2025). Alzheimer’s disease recognition using graph neural network by leveraging image-text similarity from vision language model. Scientific Reports. [85] Weng, X., Pang, C., Xia, G. (2024). Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives. IEEE Geoscience and Remote Sensing Magazine. [86] Muksimova, S., Umirzakova, S., Sultanov, M., et al. (2025). Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization. Sensors. [87] Yang, W., Wei, Y., Wei, H., et al. (2023). Survey on Explainable AI: From Approaches, Limitations and Applications Aspects. Human-Centric Intelligent Systems. [88] Calderon, N., Reichart, R. (2024). On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs. [89] Kazlaris, I., Antoniou, E., Diamantaras, K., et al. (2025). From Illusion to Insight: A Taxonomic Survey of Hallucination Mitigation Techniques in LLMs. Preprints.org. [90] Premsri, T., Kordjamshidi, P. (2024). Neuro-symbolic Training for Spatial Reasoning over Natural Language.