Self-supervised learning (SSL) has shown great promise in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and image modeling. While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for downstream tasks that require fine-grained or spatially localized representations. In this focused survey, we study SSL for object detection under challenging practical conditions, with particular emphasis on small object detection, domain shift and few-shot learning. Building upon previous surveys, we not only provide a detailed comparison of SSL strategies, but also assess their effectiveness for object detection using both CNN and ViT-based architectures. Our benchmark is performed fairly by fine-tuning a Faster R-CNN initialized with several exemplary SSL methods ourselves, including object-level Instance Discrimination and Masked Image Modeling methods, on the widely used COCO dataset, as well as on a domain-specific dataset focused on vehicle detection in infrared remote sensing imagery. We also evaluate the impact of pre-training on custom domain-specific datasets, highlighting how some SSL strategies are better suited for handling uncurated data. Furthermore, we assess the methods in few-shot settings and inference on noisy input, revealing important behavioral differences depending on the type of encoder used. Our findings highlight that combining approaches with complementary local and global biases improves performance across the evaluated object detection settings. Overall, this survey provides a practical guide for selecting optimal SSL strategies in different scenarios. • We propose a survey on self-supervised learning for real-world object detection. • In our benchmarks, we pay attention to small object detection performance. • Challenging conditions such as frugal setting or remote sensing data are considered. • The benefits of pre-training on custom domain-specific datasets is assessed. • A road map for selecting appropriate self-supervised learning strategies is provided.
Building similarity graph...
Analyzing shared references across papers
Loading...
Alina Ciocarlan
Université Paris-Saclay
Sidonie Lefèbvre
Université Paris-Saclay
Sylvie Le Hégarat‐Mascle
Centre National de la Recherche Scientifique
Computer Vision and Image Understanding
Université Paris-Saclay
Office National d'Études et de Recherches Aérospatiales
Laboratoire des systèmes et applications des technologies de l'information et de l'énergie
Building similarity graph...
Analyzing shared references across papers
Loading...
Ciocarlan et al. (Wed,) studied this question.
synapsesocial.com/papers/69f04e08727298f751e7201e — DOI: https://doi.org/10.1016/j.cviu.2026.104783