Self-supervised learning (SSL) has recently emerged as one of the most promising directions in artificial intelligence, allowing models to learn visual understanding directly from large amounts of unlabeled data. Instead of relying on human annotations, SSL methods teach a model to predict consistent representations of an image under different transformations, leading to robust and transferable features. However, existing approaches for Vision Transformers (ViTs) often focus on learning a single global representation for each image, while overlooking the rich spatial structure contained in the individual image patches. This can result in representations that capture overall semantics but fail to maintain spatial consistency across the image. In this thesis, we introduce a new method called Bag of Projected Patch Embeddings (BoPPE), designed to make better use of the information available in all image patches during training. Rather than summarizing an image through a single token or global average, BoPPE processes every patch separately, projects it into a shared embedding space, and compares it to a set of learned prototypes. The combined result of these patch-level comparisons forms a global image descriptor, ensuring that all parts of the image contribute meaningfully to learning. This simple modification enhances how gradients are distributed across the model, resulting in more coherent and spatially balanced representations that better capture object structure and context. We evaluate BoPPE on several large-scale benchmarks, including ImageNet-100 and ImageNet-1K, using different Vision Transformer architectures. Across all settings, BoPPE consistently outperforms strong self-supervised baselines such as DINO and MSN. It achieves higher accuracy in both k-nearest neighbor (k-NN) and linear evaluation protocols, and exhibits improved robustness under common image corruptions. Remarkably, BoPPE also performs strongly in extremely low-shot learning scenarios, where only one labeled image per class is available, achieving up to six percentage points improvement over baselines. These results show that the method produces representations that generalize better, even with minimal supervision or under challenging conditions. Overall, this work demonstrates that BoPPE is an effective and scalable enhancement to current self-supervised learning frameworks. By combining global consistency with patch-level awareness, it bridges the gap between holistic and local visual understanding in Vision Transformers. The findings suggest that encouraging cooperation between image patches during training leads to more stable, discriminative, and transferable representations.
Δημήτριος Μ. Κατσίκας (Wed,) studied this question.