The design of residual connections in large-scale Transformer architectures has emerged as a criticalaxis for improving training stability and model performance. Xie et al. 2026 propose Manifold-Constrained Hyper-Connections (mHC), which constrain the residual mixing matrix Hres to theBirkhoff polytope via the Sinkhorn–Knopp algorithm, claiming improved training stability overunconstrained Hyper-Connections and a training overhead of only 6. 7% relative to a standard baseline. This paper presents a critical analysis of the theoretical and empirical claims of Xie et al. , 2026and a small-scale reproducibility study comparing a 124M parameter mHC Transformer against HCand baseline counterparts trained on WikiText-2 using pure PyTorch on a single NVIDIA RTX 4060GPU. Three findings are reported: the doubly stochastic constraint functions as claimed, maintainingan Hᵣes gain of exactly 1. 000 throughout training; the reported efficiency does not reproducewithout proprietary infrastructure, with mHC introducing 208. 2% per-step overhead under standardexecution compared to the claimed 6. 7%; and performance gains are scale-dependent, with thebaseline outperforming both HC and mHC on validation perplexity at this scale, raising questionsabout the generalisability of results obtained exclusively on MoE architectures with a single trainingrun.
Building similarity graph...
Analyzing shared references across papers
Loading...
Thomas Jego
Pôle Universitaire Léonard de Vinci
Building similarity graph...
Analyzing shared references across papers
Loading...
Thomas Jego (Tue,) studied this question.
www.synapsesocial.com/papers/69a91e4cd6127c7a504c21b0 — DOI: https://doi.org/10.5281/zenodo.18852696