What question did this study set out to answer?

This research compares the performance of Transformer architectures with Mamba variants in masked latent prediction tasks.

March 31, 2026Open Access

When Does BiMamba Beat Transformers in JEPA-style Masked Latent Prediction? Evidence from Image and Video Benchmarks

Key Points

This research compares the performance of Transformer architectures with Mamba variants in masked latent prediction tasks.
Compared 7 architectures: Transformer, Vanilla Mamba, Bidirectional Mamba (BiMamba), and 4 Sequential Attention variants.
Evaluated across 5 datasets including Moving MNIST and HMDB-51, ranging from simple images to complex videos.
Analyzed mean squared error (MSE) to determine architectural performance.
BiMamba achieves approximately half the MSE of Transformer on fine-grained temporal tasks (HMDB-51).
On tasks with coarse temporal structure, Transformer shows competitive performance (ImageNet: BiMamba/TF = 1.10×; UCF-101: 1.31× ± 0.10).
Sequential Attention approaches show structural failures for Mamba.

Abstract

Wepresent, to our knowledge, the first empirical comparison of Transformer attention and Mamba (Structured State Space Model) in Joint-Embedding Predictive Architecture (JEPA). While Mamba has shown competitive results in classification and generation tasks, its applicability to JEPA’s masked latent prediction objective remains unexplored. Wecompare 7 architectures—Transformer, Vanilla Mamba, Bidirectional Mamba (BiMamba), and 4 Sequential Attention variants—across 5 datasets ranging from simple images (Moving MNIST) to complex videos (HMDB-51). Our key finding is that fine grained temporal ambiguity in the task correlates with architecture suitability: on tasks with coarse temporal structure, Transformer remains competitive or better (ImageNet: BiMamba/TF = 1.10×; UCF-101: 1.31× ± 0.10), while on tasks requiring fine-grained temporal discrimination (HMDB-51), BiMamba consistently achieves roughly half the MSE of Transformer (0.55× ± 0.02, reproducible across 3 seeds).Wealso demonstrate why Sequential Attention approaches structurally fail for Mamba and confirm that modality-specific FFN separation remains beneficial even when allmodalities share the same loss function. This is a toy-scale empirical study. We study architectural trends rather than claim state of-the-art capability. Reported ratios should be interpreted as directional evidence, not production-ready benchmarks. All code, checkpoints, and results are publicly available.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Brian Kim

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

When Does BiMamba Beat Transformers in JEPA-style Masked Latent Prediction? Evidence from Image and Video Benchmarks

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study