To address the problems of severe fruit occlusion, large variations in target scale, and many small-scale goals being overlooked in the recognition of trellised watermelons under complex agricultural scenarios, this study proposes an improved RT-DETR-based detection model, termed RT-DETR-Watermelon. A context-guided (CG) module is embedded into the backbone network. A dedicated P2 detection layer is added to enhance the model’s sensitivity to small objects. A scale sequence feature fusion (SSFF) module and a triple feature encoder (TFE) module are introduced into the model to improve the model’s capability to detect targets at multiple scales. The original bounding box regression loss is replaced with MPDIoU (Multiple Path Distance Intersection over Union) loss, which accelerates model convergence and improves localization precision. Finally, the number of channels is adjusted to reduce parameter count, computational complexity, and storage size. The experimental results show that, compared with the original RT-DETR model, the proposed RT-DETR-Watermelon model increases precision, recall, and mean Average Precision (mAP@0.5) by 0.4, 1.8, and 1.0 percentage points, while reducing the number of parameters, computational cost, and model size by 53.5%, 23.5%, and 53.2%, respectively.
Yan et al. (Tue,) studied this question.