What question did this study set out to answer?

The research aims to enhance small object detection by addressing limitations in existing DETR-based models.

February 9, 2026Open Access

Small Object Detection with Efficient Multi-Scale Collaborative Attention and Depth Feature Fusion Based on Detection Transformer

Key Points

The research aims to enhance small object detection by addressing limitations in existing DETR-based models.
Developed a novel detection model named ED-DETR.
Implemented an efficient multi-scale collaborative attention mechanism (EMCA).
Introduced DepthPro for zero-shot monocular depth estimation.
Created an adaptive feature fusion module for integrating depth and RGB maps.
Achieved 33.6% mean Average Precision (mAP) on the AI-TOD-V2 dataset.
Outperformed previous CNN-based and DETR-based methods for small object detection.
Demonstrated excellent generalization on VisDrone and COCO datasets.

Abstract

Existing DEtection TRansformer-based (DETR) object detection methods have been widely applied to standard object detection tasks, but still face numerous challenges in detecting small objects. These methods frequently miss the fine details of small objects and fail to preserve global context, particularly under scale variation or occlusion. The resulting feature maps lack sufficient spatial and structural information. Moreover, some DETR-based models specifically designed for small object detection often have poor generalization capabilities and are difficult to adapt to datasets with diverse object scales and complex backgrounds. To address these issues, this paper proposes a novel object detection model—small object detection with efficient multi-scale collaborative attention and depth feature fusion based on DETR (ED-DETR)—which consists of three core modules: an efficient multi-scale collaborative attention mechanism (EMCA), DepthPro, a zero-shot metric monocular depth estimation model, and an adaptive feature fusion module for depth maps and feature maps. Specifically, EMCA extends the single-space attention mechanism in efficient multi-scale attention (EMA) to a composite structure of parallel spatial and channel attention, enhancing ED-DETR’s ability to express features collaboratively in both spatial and channel dimensions. DepthPro generates depth maps to extract depth information. The adaptive feature fusion module integrates depth information with RGB visual features, improving ED-DETR’s ability to perceive object position, scale, and occlusion. The experimental results show that ED-DETR achieves the current best 33.6% mAP on the AI-TOD-V2 dataset, which predominantly contains tiny objects, outperforming previous CNN-based and DETR-based methods, and shows excellent generalization performance on the VisDrone and COCO datasets.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper