What question did this study set out to answer?

The aim is to quantify the information-loss in compressing a depth-2 feed-forward layer of a Transformer into a single fully connected layer.

June 11, 2026Open Access

Lower bound on the information-loss incurred by compressing a depth-2 feed-forward layer of transformer into a single fully connected layer

Key Points

The aim is to quantify the information-loss in compressing a depth-2 feed-forward layer of a Transformer into a single fully connected layer.
Modeling the initial input as a binary vector with independent entries.
Defining the event A related to disjoint subsets and all entries being equal to 1.
Applying linear functionals followed by a Heaviside function to study approximations of event A.
An explicit lower bound on the relative error of approximations is established for all linear functionals.
This lower bound approaches 1/16 as k increases, indicating a significant error for large k.
Results support the heuristic that sparse Transformers perform better despite having more parameters.

Abstract

We consider the compression of a depth-2 feed-forward layer of Transformer into a single fully connected layer. To model this, we take a binary vector with independent entries as input. We define the event A to be that for two disjoint subsets of size k of the 0 , 1 entries of the vector, all entries of at least one of the subsets are equal to 1 . This represents the information of two layers of a feed-forward layer. We study the approximation of the event A by applying a linear functional to the binary vector, followed by a Heaviside (threshold) function. We establish an explicit lower bound on the relative error of any such approximation, valid for all choices of linear functionals. Notably, this lower bound approaches 1 / 16 as k becomes large. This result provides a theoretical explanation for the well-known heuristic that sparse Transformers, although requiring more parameters, achieve better performance. If it were possible to approximate A accurately with a dense representation, one could convert sparse architectures to dense ones without any loss in performance—but our result shows that such a compression necessarily incurs a significant error.

Lower bound on the information-loss incurred by compressing a depth-2 feed-forward layer of transformer into a single fully connected layer

Key Points

Abstract

Cite This Study