Existing multimodal models (CLIP, ImageBind) align physical and linguisticrepresentations in a single shared space — losing the structure of each modality inthe process. We propose an alternative: keep the physical and linguistic planesseparate, combining them through a reversible addition operation. The centralclaim is that if physics + language = combined, then the physical plane can berecovered from the combined embedding without any linguistic context — purelythrough subtraction. Experiments on synthetic data confirm the viability of thisarchitecture: the physics recovery error was 0.0109, demonstrating zero-shotgeneralization through meaning rather than through language tokens.
Artem Gorbunov (Wed,) studied this question.