The high-precision instance segmentation of tree saplings is a fundamental prerequisite for the high-throughput phenotypic analysis of individual seedlings in intelligent tree breeding and precision silviculture. However, sapling segmentation remains challenging because of blurred boundaries, object adhesion, missed detections, and inaccurate mask delineation in field environments. To improve sapling segmentation performance and address these challenges, this study proposes a multimodal Mask R-CNN framework in which RGB imagery was paired with one multispectral-derived vegetation index at a time to construct separate RGB-VI input combinations, taking ginkgo saplings as a representative case. A dataset of 400 saplings was constructed using a high-throughput field phenotyping platform. The backbone network was extended with an independent vegetation index branch, and three fusion strategies (early, multi-step, and late fusion) were designed within a feature pyramid network to enable multi-scale multimodal feature integration. The results showed that all multimodal models outperformed unimodal baselines in terms of segmentation accuracy and recall. Among them, the multi-step fusion strategy achieved the best performance, while the RGB-EVI multi-step fusion model achieved the highest strict-matching precision (AP@75 = 87.7%) and recall (71.3%), with superior performance in dense sapling delineation and background suppression. These findings indicate that multimodal feature fusion can effectively improve sapling instance segmentation and provide methodological support for high-throughput plant phenotyping.
Jiang et al. (Tue,) studied this question.