Collecting annotations for dense scene object detection (OD) is notoriously labor-intensive, as real-world images often contain hundreds of small, overlapping objects. In many practical scenarios, even annotating a few images is tedious and frustrating, motivating the need for unsupervised solutions. Despite its importance, this problem remains largely unexplored, with no existing methods specifically designed to address it. Therefore, in this work, we tackle the problem of training object detectors for dense scenes without any human annotations. Leveraging GroundingDINO 1, a recent Vision-Language model (VLM), which supports open-vocabulary detection conditioned by textual prompts, we explore its potential for generating pseudo-labels (PLs) directly from images. However, standard inference of this model often fail to capture all relevant objects in crowded scenes. To overcome this problem, we introduce M ulti- S cale T iling I nference (MSTI), a strategy that applies the VLM over overlapping image tiles at multiple tile scales to enhance performance through improved coverage. We further propose a pre-filtering mechanism that adaptively determines when and where in an image, MSTI should be applied. This results in a final pipeline, AdaMSTI, that balances accuracy with computational efficiency. The results from experiments conducted on two datasets, CrowdHuman 2 and SKU110K 3, highlight the efficiency of our method: the PLs generated by the proposed method are significantly more accurate than the naive baselines with GroundingDINO 1 , leading to notable improvements in detectors trained on these labels, compared to those that were trained with Cut-and-LEaRn (CutLER) 4, the state-of-the-art unsupervised OD for curated data. Furthermore, on CrowdHuman 2 where the object counts per image vary, the full AdaMSTI pipeline demonstrates that it produces high-quality PLs while reducing unnecessary computation. Detectors trained on AdaMSTI PLs achieve higher validation performance than those trained on naive or full MSTI PLs.
Dinh et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: