What question did this study set out to answer?

The research aims to extract and represent physical knowledge encoded in video foundation models in a structured manner.

March 25, 2026Open Access

Emergent Compositional Communication for Latent World Properties

Key Points

The research aims to extract and represent physical knowledge encoded in video foundation models in a structured manner.
Utilized multi-agent systems to apply communication pressure during training.
Employed Gumbel-Softmax as a bottleneck for compositional representation.
Conducted controlled comparisons in physics simulations using DINOv2 and V-JEPA 2 models.
Implemented causal interventions to assess the impact of targeted property manipulation.
DINOv2 showed superior performance in spatially-visible physics over V-JEPA 2; 98.3% vs. 95.1%.
V-JEPA 2 performed better in dynamics-only collision scenarios (87.4% vs. 77.7%).
Multi-agent training led to 100% convergence on positional disentanglement with 4 agents.
Causal interventions confirmed significant effects on targeted properties with minimal impact on others.

Abstract

What physical knowledge do video foundation models encode, and can it be extracted into dis- crete, compositional form? We show that multi-agent communication pressure, combined with a discrete Gumbel-Softmax bottleneck and iterated learning, induces compositional representations of world prop- erties that are invisible in any single observationelasticity, friction, mass ratiofrom frozen pretrained features alone. The backbone determines what is communicable. In a controlled 2×2 factorial comparison on physics simulations, DINOv2 Oquab et al., 2023 dominates on spatially-visible physics (ramp: 98.3% vs. 95.1%), while V-JEPA 2 Assran et al., 2025 dominates on dynamics-only collision physics where properties are recoverable only from temporal velocity dierences (87.4% vs. 77.7%, d=2.74). Scale- matched and frame-matched controls denitively attribute this gap to video-native pretraining: DINOv2 ViT-L at matched parameters performs worse (d=3.37), and DINOv2 with matched frame count degrades further (d=6.53). This extends the ndings of Garrido et al. 2025, who showed V-JEPA representations encode intuitive physics: our results demonstrate that this physical knowledge can be compressed into discrete, compositional codes under communication pressure. Multi-agent structure drives compositionality. With 4 agents, 100% of seeds converge to near-perfect positional disentanglement (PosDis = 0.999, holdout 98.3%; n=80), while 2 agents pro- duce compositional protocols stochastically (54%). Targeted controls conrm the driver is multi-agent structurenot bandwidth or temporal coverage. Causal intervention via position-zeroing shows surgical property disruption (∼15% drop on targeted property, <3% on others), and the frozen protocol trans- fers to cross-property reasoning (93.8%), outcome prediction (88.7% at 25× compression), and action- conditioned planning with counterfactual velocity reasoning (91.5%, r=0.780). Validation on Physics 101 real camera footage Wu et al., 2016 conrms mass-comparison accuracy of 85.6% on unseen objects, with temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and the causal intervention extending to real video (zeroing the mass- relevant agent reduces accuracy by 7.8pp while the other causes only 2.1pp disruption; p=0.022, d=1.87). The discrete communication channel functions as an empirical instantiation of the latent variable z in LeCun's (2022) cognitive architectureour results provide evidence that discrete latent structure within JEPA-style world models supports compositional, causally addressable physical reasoning.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper