This paper develops a unified geometric and algebraic analysis of the attention mechanism in Transformer architectures. The central observation is that, given a token embedding matrix X R^N d, every similarity matrix producible by a single attention head lies inside a nested hierarchy V (dₖ) (X) V (X) R^N N is a vector subspace of dimension at most r² (with r=rk (X) ) and V^ (dₖ) (X) is its intersection with the rank-dₖ determinantal variety. From this structure we derive results of three kinds. Principal contributions: (i) a matching necessary-and-sufficient discriminability theorem showing that the minimum number of attention heads achievable by some choice of query projections is exactly d/dₖ, providing an a priori structural justification for the standard design h dₖ=d that complements the capacity-based account and addresses the gap; (ii) a tighter effective-rank version r/dₖ when rk (X) =r<d; (iii) an exact characterization dimV (X) =r² and a structural capacity bound on the family of attention operators that is independent of N, d, and H. Consistency checks with the established literature: we recover, within the geometric framework, the asymmetric-kernel view of, the relative-distance property of Rotary Position Embedding, and the gradient-stability advantage of Pre-LN over Post-LN. We are explicit about which results are new and which are reformulations. All theorems are accompanied by numerical experiments confirming that the predicted bounds are tight.
Guillermo Blas Sentoni (Fri,) studied this question.