Segmentation of tomato fruits under complex lighting conditions remains technically challenging, especially in low illumination or overexposure, where RGB-only methods often suffer from blurred boundaries and missed small or occluded instances, and simple multimodal fusion cannot fully exploit complementary cues. To address these gaps, we propose YOLO-MSRF, a lightweight RGB–NIR multimodal segmentation and refinement framework for robust tomato perception in facility agriculture. Firstly, we propose a dual-branch multimodal backbone, introduce Cross-Modality Difference Complement Fusion (C-MDCF) for difference-based complementary RGB–NIR fusion, and design C2f-DCB to reduce computation while strengthening feature extraction. Furthermore, we develop a cross-scale attention fusion network and introduce the proposed MS-CPAM to jointly model multi-scale channel and position cues, strengthening fine-grained detail representation and spatial context aggregation for small and occluded tomatoes. Finally, we design the Multi-Scale Fusion and Semantic Refinement Network, MSF-SRNet, which combines the Scale-Concatenate Fusion Module (Scale-Concat) fusion with SDI-based cross-layer detail injection to progressively align and refine multi-scale features, improving representation quality and segmentation accuracy. Extensive experiments show that YOLO-MSRF achieves substantial gains under weak and low-light conditions, where RGB-only models are most prone to boundary degradation and missed instances, and it still delivers consistent improvements on the mixed four-light validation set, increasing mAP0.5 by 2.3 points, mAP0.5–0.95 by 2.4 points, and mIoU by 3.60 points while maintaining real-time inference at 105.07 FPS. The proposed system further supports counting, size estimation, and maturity analysis of harvestable tomatoes, and can be integrated with depth sensing and yield estimation to enable real-time yield prediction in practical greenhouse operations.
Li et al. (Thu,) studied this question.