What question did this study set out to answer?

This research aims to improve self-supervised learning representations by addressing spatial structure utilization in image patches.

May 7, 2026Open Access

Self-Supervised Learning using Bag of Image Patch Embeddings

Key Points

This research aims to improve self-supervised learning representations by addressing spatial structure utilization in image patches.
Introduced Bag of Projected Patch Embeddings (BoPPE) for image processing.
Evaluated performance on benchmarks including ImageNet-100 and ImageNet-1K.
Compared BoPPE with strong self-supervised baselines like DINO and MSN.
BoPPE outperformed baseline models consistently across benchmarks.
Achieved higher accuracy in k-nearest neighbor and linear evaluation protocols.
Demonstrated improved robustness under image corruptions and in low-shot learning scenarios.

Abstract

Self-supervised learning (SSL) has recently emerged as one of the most promising directions in artificial intelligence, allowing models to learn visual understanding directly from large amounts of unlabeled data. Instead of relying on human annotations, SSL methods teach a model to predict consistent representations of an image under different transformations, leading to robust and transferable features. However, existing approaches for Vision Transformers (ViTs) often focus on learning a single global representation for each image, while overlooking the rich spatial structure contained in the individual image patches. This can result in representations that capture overall semantics but fail to maintain spatial consistency across the image. In this thesis, we introduce a new method called Bag of Projected Patch Embeddings (BoPPE), designed to make better use of the information available in all image patches during training. Rather than summarizing an image through a single token or global average, BoPPE processes every patch separately, projects it into a shared embedding space, and compares it to a set of learned prototypes. The combined result of these patch-level comparisons forms a global image descriptor, ensuring that all parts of the image contribute meaningfully to learning. This simple modification enhances how gradients are distributed across the model, resulting in more coherent and spatially balanced representations that better capture object structure and context. We evaluate BoPPE on several large-scale benchmarks, including ImageNet-100 and ImageNet-1K, using different Vision Transformer architectures. Across all settings, BoPPE consistently outperforms strong self-supervised baselines such as DINO and MSN. It achieves higher accuracy in both k-nearest neighbor (k-NN) and linear evaluation protocols, and exhibits improved robustness under common image corruptions. Remarkably, BoPPE also performs strongly in extremely low-shot learning scenarios, where only one labeled image per class is available, achieving up to six percentage points improvement over baselines. These results show that the method produces representations that generalize better, even with minimal supervision or under challenging conditions. Overall, this work demonstrates that BoPPE is an effective and scalable enhancement to current self-supervised learning frameworks. By combining global consistency with patch-level awareness, it bridges the gap between holistic and local visual understanding in Vision Transformers. The findings suggest that encouraging cooperation between image patches during training leads to more stable, discriminative, and transferable representations.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Δημήτριος Μ. Κατσίκας (Wed,) studied this question.

synapsesocial.com/papers/69fbefc0164b5133a91a3bde https://doi.org/https://doi.org/10.26262/heal.auth.ir.372587

Perguntar à IA

Bookmark

View Full Paper