How can compositionality be measured in learned representations?

Compositionality in learned representations is the degree to which a model encodes complex expressions by combining representations of simpler parts. Measuring it requires both empirical benchmarks and analytic probes that reveal whether internal vectors compose in systematic, rule-like ways that support new combinations not seen during training. Scholars from linguistics and machine learning emphasize this as central to generalization: Noam Chomsky at MIT framed compositionality as a linguistic principle, and Brenden Lake at New York University operationalized it for neural models with benchmarks that test systematic generalization.

Analytic and empirical measures

Researchers commonly use probing classifiers to test whether specific symbolic properties are linearly decodable from hidden states; success suggests representations contain decomposable content. Controlled generalization splits such as the SCAN benchmark introduced by Brenden Lake and Marco Baroni test whether models trained on a subset of command–action pairs can produce correct outputs for novel compositions. Representational Similarity Analysis compares distances between representations of composed inputs and expected combinations of part representations, while information-theoretic measures like mutual information quantify dependence between parts and wholes. Causal interventions and counterfactual inputs probe whether changing a subcomponent produces the predicted change in composed output, revealing functional compositionality rather than mere correlation.

Causes, relevance, and consequences

Compositional structure in learned representations arises from model inductive biases, training objectives, and data distribution. Yoshua Bengio at Université de Montréal has argued that appropriate inductive biases—architectural constraints or learning priors—encourage compositional solutions. When models lack such biases or are trained on culturally skewed corpora, they may memorize combinations instead of forming recombinable parts, reducing robustness across domains and disadvantaging low-resource languages and marginalized communities whose constructions are underrepresented. Environmentally, reliance on massive, diverse datasets has energy and territorial implications: training to capture compositionality at scale can increase compute and carbon costs that concentrate in certain regions and institutions.

Measuring compositionality therefore matters for trustworthiness and practical deployment. Clear, reproducible metrics combining probes, controlled benchmarks, and causal tests help determine whether a system will generalize to unseen combinations or fail in predictable ways. Nuanced interpretation is essential: linear decodability does not guarantee human-like symbolic reasoning, and benchmark success may reflect dataset artifacts. Combining multiple, transparently reported measures provides stronger evidence that a model truly composes, supporting safer, more equitable applications.