In the pursuit of advancing medical artificial intelligence, this study demonstrates that a pretrained Vision Transformer paired with a linear classifier can achieve highly competitive performance in endoscopic image classification. We present a systematic, layerwise analysis that identifies the source of the most discriminative features, thereby challenging the common heuristic according to which only the final layer is used. We identify a distinct peak-before-the-end phenomenon, wherein a late-intermediate layer provides a more generalizable representation for the downstream medical task. On the standard Kvasir and HyperKvasir datasets, our parameter-efficient approach not only achieves excellent accuracy but also drastically reduces computational overhead. This work can serve as a practical guide for the efficient utilization of features from general foundation models in clinical settings.
Taha et al. (Mon,) studied this question.