Wepresent, to our knowledge, the first empirical comparison of Transformer attention and Mamba (Structured State Space Model) in Joint-Embedding Predictive Architecture (JEPA). While Mamba has shown competitive results in classification and generation tasks, its applicability to JEPA’s masked latent prediction objective remains unexplored. Wecompare 7 architectures—Transformer, Vanilla Mamba, Bidirectional Mamba (BiMamba), and 4 Sequential Attention variants—across 5 datasets ranging from simple images (Moving MNIST) to complex videos (HMDB-51). Our key finding is that fine grained temporal ambiguity in the task correlates with architecture suitability: on tasks with coarse temporal structure, Transformer remains competitive or better (ImageNet: BiMamba/TF = 1.10×; UCF-101: 1.31× ± 0.10), while on tasks requiring fine-grained temporal discrimination (HMDB-51), BiMamba consistently achieves roughly half the MSE of Transformer (0.55× ± 0.02, reproducible across 3 seeds).Wealso demonstrate why Sequential Attention approaches structurally fail for Mamba and confirm that modality-specific FFN separation remains beneficial even when allmodalities share the same loss function. This is a toy-scale empirical study. We study architectural trends rather than claim state of-the-art capability. Reported ratios should be interpreted as directional evidence, not production-ready benchmarks. All code, checkpoints, and results are publicly available.
Building similarity graph...
Analyzing shared references across papers
Loading...
Brian Kim
Building similarity graph...
Analyzing shared references across papers
Loading...
Brian Kim (Sun,) studied this question.
www.synapsesocial.com/papers/69cb6589e6a8c024954b98d6 — DOI: https://doi.org/10.5281/zenodo.19323214