What question did this study set out to answer?

This research aims to enhance the instance segmentation of tree saplings for precise phenotypic analysis.

June 4, 2026Open Access

High-Precision Instance Segmentation of Tree Saplings by Multimodal Mask R-CNN Integrating RGB and Multispectral Image-Derived Indices Through a Field Phenotyping Platform

Key Points

This research aims to enhance the instance segmentation of tree saplings for precise phenotypic analysis.
Developed a multimodal Mask R-CNN framework integrating RGB imagery with multispectral indices.
Constructed a dataset of 400 saplings using a field phenotyping platform.
Implemented three fusion strategies for multimodal feature integration.
The multi-step fusion model outperformed unimodal baselines in segmentation accuracy and recall.
The RGB-EVI multi-step fusion model achieved the highest strict-matching precision (AP@75 = 87.7%) and recall (71.3%).
Demonstrated superior performance in delineating dense saplings and suppressing background clutter.

Abstract

The high-precision instance segmentation of tree saplings is a fundamental prerequisite for the high-throughput phenotypic analysis of individual seedlings in intelligent tree breeding and precision silviculture. However, sapling segmentation remains challenging because of blurred boundaries, object adhesion, missed detections, and inaccurate mask delineation in field environments. To improve sapling segmentation performance and address these challenges, this study proposes a multimodal Mask R-CNN framework in which RGB imagery was paired with one multispectral-derived vegetation index at a time to construct separate RGB-VI input combinations, taking ginkgo saplings as a representative case. A dataset of 400 saplings was constructed using a high-throughput field phenotyping platform. The backbone network was extended with an independent vegetation index branch, and three fusion strategies (early, multi-step, and late fusion) were designed within a feature pyramid network to enable multi-scale multimodal feature integration. The results showed that all multimodal models outperformed unimodal baselines in terms of segmentation accuracy and recall. Among them, the multi-step fusion strategy achieved the best performance, while the RGB-EVI multi-step fusion model achieved the highest strict-matching precision (AP@75 = 87.7%) and recall (71.3%), with superior performance in dense sapling delineation and background suppression. These findings indicate that multimodal feature fusion can effectively improve sapling instance segmentation and provide methodological support for high-throughput plant phenotyping.

High-Precision Instance Segmentation of Tree Saplings by Multimodal Mask R-CNN Integrating RGB and Multispectral Image-Derived Indices Through a Field Phenotyping Platform

Key Points

Abstract

Cite This Study