What question did this study set out to answer?

To develop a robust segmentation and refinement framework for detecting and estimating tomato fruit size and count under complex lighting.

January 25, 2026Open Access

YOLO-MSRF: A Multimodal Segmentation and Refinement Framework for Tomato Fruit Detection and Segmentation with Count and Size Estimation Under Complex Illumination

Key Points

To develop a robust segmentation and refinement framework for detecting and estimating tomato fruit size and count under complex lighting.
Introduced a dual-branch multimodal backbone for RGB-NIR fusion.
Employed Cross-Modality Difference Complement Fusion (C-MDCF) for better feature integration.
Created a cross-scale attention network to enhance spatial context and detail representation.
Designed Multi-Scale Fusion and Semantic Refinement Network (MSF-SRNet) for improved feature alignment.
Achieved a 2.3 point increase in mAP0.5 and a 2.4 point increase in mAP0.5-0.95 under challenging conditions.
Improved mIoU by 3.60 points, indicating enhanced segmentation quality.
Maintained real-time inference at 105.07 FPS, making it practical for greenhouse applications.

Abstract

Segmentation of tomato fruits under complex lighting conditions remains technically challenging, especially in low illumination or overexposure, where RGB-only methods often suffer from blurred boundaries and missed small or occluded instances, and simple multimodal fusion cannot fully exploit complementary cues. To address these gaps, we propose YOLO-MSRF, a lightweight RGB–NIR multimodal segmentation and refinement framework for robust tomato perception in facility agriculture. Firstly, we propose a dual-branch multimodal backbone, introduce Cross-Modality Difference Complement Fusion (C-MDCF) for difference-based complementary RGB–NIR fusion, and design C2f-DCB to reduce computation while strengthening feature extraction. Furthermore, we develop a cross-scale attention fusion network and introduce the proposed MS-CPAM to jointly model multi-scale channel and position cues, strengthening fine-grained detail representation and spatial context aggregation for small and occluded tomatoes. Finally, we design the Multi-Scale Fusion and Semantic Refinement Network, MSF-SRNet, which combines the Scale-Concatenate Fusion Module (Scale-Concat) fusion with SDI-based cross-layer detail injection to progressively align and refine multi-scale features, improving representation quality and segmentation accuracy. Extensive experiments show that YOLO-MSRF achieves substantial gains under weak and low-light conditions, where RGB-only models are most prone to boundary degradation and missed instances, and it still delivers consistent improvements on the mixed four-light validation set, increasing mAP0.5 by 2.3 points, mAP0.5–0.95 by 2.4 points, and mIoU by 3.60 points while maintaining real-time inference at 105.07 FPS. The proposed system further supports counting, size estimation, and maturity analysis of harvestable tomatoes, and can be integrated with depth sensing and yield estimation to enable real-time yield prediction in practical greenhouse operations.

YOLO-MSRF: A Multimodal Segmentation and Refinement Framework for Tomato Fruit Detection and Segmentation with Count and Size Estimation Under Complex Illumination

Key Points

Abstract

Cite This Study