The Transformer architecture drives modern Artificial Intelligence (AI), yet the physical principles that may constrain self-attention training remain poorly characterized. We develop a thermodynamic framework for attention training, drawing on the established Boltzmann correspondence between softmax attention and equilibrium statistical mechanics, and we propose a First Law analogue that decomposes the training energy budget into a heat term (the entropic cost of ordering attention) and a work term (the gain in mutual information about the target). From this framework we derive a Landauer-type bound on learning, which states that the loss reduction during training is bounded below by the entropic cost of structuring attention against thermal noise. The bound is satisfied across all configurations tested: 625 grid points spanning three datasets on a compact Vision Transformer trained from scratch (MNIST, CIFAR-10, and OrganAMNIST), and ten temperatures on a pretrained ViT-Small fine-tuned on Food-101. Reusing the same physical principles at inference time, we show that the thermodynamic work performed by each input patch provides a quantitative, energy-based measure of feature importance that outperforms standard attention weights and Integrated Gradients on ImageNet across pretrained ViT-Small, ViT-Base, and ViT-Large (22M to 304M parameters). The result is an integrated diagnostic framework that links phase structure, training-time bounds, and inference-time attribution within a single empirically falsifiable thermodynamic apparatus.
Building similarity graph...
Analyzing shared references across papers
Loading...
Roberto C. Sotero
University of Calgary
Jose M. Sanchez-Bornot
University of Ulster
AI
University of Calgary
University of Ulster
Building similarity graph...
Analyzing shared references across papers
Loading...
Sotero et al. (Tue,) studied this question.
synapsesocial.com/papers/6a17db9a3fad632b0f9d8600 — DOI: https://doi.org/10.3390/ai7060194