Modern Large Language Models (LLMs) suffer from static computation depth, wheretrivial and highly complex prompts consume identical vertical computational resources(FLOPs). While Mixture-of-Experts (MoE) architectures provide horizontal sparsity, theyfail to address the layer-wise redundancy and GPU thread divergence caused by token-levelrouting. In this paper, we propose In-Depth-MoE, a novel architectural paradigm thatintroduces predictive sequence-level layer gating. By utilizing an ultra-lightweight Nonautoregressive Agent-Router operating on the global semantic representation of the inputprompt, the system dynamically generates a binary execution mask via a Straight-ThroughEstimator (STE). This mask physically truncates the computation graph prior to generation, allocating full-depth processing exclusively to semantically complex queries while preemptively bypassing redundant layers for simpler tasks. Our experimental validation on a10-layer prototype demonstrates successful bifurcation of execution states: structurally simple prompts converge on an active layer ratio of 20.00% (saving 78.50% FLOPs), whereascomplex mathematical sequences scale to an active layer ratio of 80.00% (saving 18.50%FLOPs), proving robust hardware optimization without introducing thread divergence.
Stamboltsyan David (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: