Scaling the operational sequence length of large generative models frequently introduces a fundamental structural trade-off: modifications that enable massive context ingestion consistently degrade model acuity on short-range, position-sensitive cognitive tasks. This reveals a fundamental limitation in applying homogenous spatial representations across the entire network. To resolve this capability conflict without altering the training data distribution, we introduce a decoupled representational architecture. By analyzing the intrinsic functional sparsity within the model's intermediate layers, we identify a minor subset of routing pathways inherently responsible for distant information retrieval. We propose a differential parameterization strategy: specializing this sparse sub-network for global receptive fields via relaxed spatial constraints, while maintaining strict, high-resolution spatial constraints across the vast majority of the network's reasoning components. Empirical experiments demonstrate that this sub-network specialization methodology preserves high fidelity on local reasoning benchmarks while unlocking substantial performance gains on massive sequence tasks, offering a scalable solution for multi-regime sequence modeling.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chandler Marshall
Building similarity graph...
Analyzing shared references across papers
Loading...
Chandler Marshall (Fri,) studied this question.
synapsesocial.com/papers/69e1cf625cdc762e9d8583d1 — DOI: https://doi.org/10.5281/zenodo.19597990