We consider the compression of a depth-2 feed-forward layer of Transformer into a single fully connected layer. To model this, we take a binary vector with independent entries as input. We define the event A to be that for two disjoint subsets of size k of the 0 , 1 entries of the vector, all entries of at least one of the subsets are equal to 1 . This represents the information of two layers of a feed-forward layer. We study the approximation of the event A by applying a linear functional to the binary vector, followed by a Heaviside (threshold) function. We establish an explicit lower bound on the relative error of any such approximation, valid for all choices of linear functionals. Notably, this lower bound approaches 1 / 16 as k becomes large. This result provides a theoretical explanation for the well-known heuristic that sparse Transformers, although requiring more parameters, achieve better performance. If it were possible to approximate A accurately with a dense representation, one could convert sparse architectures to dense ones without any loss in performance—but our result shows that such a compression necessarily incurs a significant error.
Matzinger et al. (Mon,) studied this question.