What question did this study set out to answer?

The research aims to improve object detection in dense scenes without human annotations using pseudo-labels.

March 14, 2026Open Access

Multi-Scale Tiling for Pseudo-Labeling in Dense Scene Object Detection

Key Points

The research aims to improve object detection in dense scenes without human annotations using pseudo-labels.
Utilized GroundingDINO, a Vision-Language model, for generating pseudo-labels from images.
Introduced Multi-Scale Tiling Inference (MSTI) to enhance coverage by processing overlapping image tiles at different scales.
Implemented a pre-filtering mechanism for adaptive application of MSTI based on image content.
Achieved significantly more accurate pseudo-labels compared to naive baselines with GroundingDINO.
Demonstrated notable improvements in detector performance when trained on the new labels versus those trained with Cut-and-LEaRn.
Showed that the AdaMSTI pipeline generates high-quality pseudo-labels while optimizing computational efficiency.

Abstract

Collecting annotations for dense scene object detection (OD) is notoriously labor-intensive, as real-world images often contain hundreds of small, overlapping objects. In many practical scenarios, even annotating a few images is tedious and frustrating, motivating the need for unsupervised solutions. Despite its importance, this problem remains largely unexplored, with no existing methods specifically designed to address it. Therefore, in this work, we tackle the problem of training object detectors for dense scenes without any human annotations. Leveraging GroundingDINO 1, a recent Vision-Language model (VLM), which supports open-vocabulary detection conditioned by textual prompts, we explore its potential for generating pseudo-labels (PLs) directly from images. However, standard inference of this model often fail to capture all relevant objects in crowded scenes. To overcome this problem, we introduce M ulti- S cale T iling I nference (MSTI), a strategy that applies the VLM over overlapping image tiles at multiple tile scales to enhance performance through improved coverage. We further propose a pre-filtering mechanism that adaptively determines when and where in an image, MSTI should be applied. This results in a final pipeline, AdaMSTI, that balances accuracy with computational efficiency. The results from experiments conducted on two datasets, CrowdHuman 2 and SKU110K 3, highlight the efficiency of our method: the PLs generated by the proposed method are significantly more accurate than the naive baselines with GroundingDINO 1 , leading to notable improvements in detectors trained on these labels, compared to those that were trained with Cut-and-LEaRn (CutLER) 4, the state-of-the-art unsupervised OD for curated data. Furthermore, on CrowdHuman 2 where the object counts per image vary, the full AdaMSTI pipeline demonstrates that it produces high-quality PLs while reducing unnecessary computation. Detectors trained on AdaMSTI PLs achieve higher validation performance than those trained on naive or full MSTI PLs.

Multi-Scale Tiling for Pseudo-Labeling in Dense Scene Object Detection

Key Points

Abstract

Cite This Study

Also Consider

Also Consider