What question did this study set out to answer?

This research aims to enhance individual tree crown segmentation using a novel framework that integrates context clustering and a MaskFormer decoder.

March 13, 2026Open Access

CrownViM: Context Clustering Meets Vision Mamba for Precise Tree Crown Segmentation in Aerial RGB Imagery

Key Points

This research aims to enhance individual tree crown segmentation using a novel framework that integrates context clustering and a MaskFormer decoder.
Developed a CrownViM architecture based on a bidirectional state space model.
Integrated a Context Clustering Vision Mamba encoder for global context modeling.
Employed a MaskFormer decoder for precise boundary predictions.
Introduced a partial-supervision loss function to minimize reliance on annotated crown masks.
Evaluated performance against existing methods on specific segmentation datasets.
CrownViM achieved significant improvements in segmentation accuracy compared to CNN, ViT, and hybrid baselines.
Maintained a lightweight model profile with 39.6 million parameters.
Effectively addressed challenges in overlapping crown scenarios and complex scenes.

Abstract

The proliferation of high-spatial-resolution remote sensing data is transforming forest attribute estimation, replacing traditional manual approaches with deep learning-based Individual Tree Crown Delineation (ITCD). Nevertheless, accurate ITCD boundary extraction from aerial RGB imagery faces persistent challenges: boundary ambiguity from complex crown occlusion in mixed forests, scarcity of high-quality annotations, and computational limitations of existing methods in dense forests. The latter manifests particularly in overlapping crown scenarios through constrained receptive fields, leading to substantial parameter requirements, computational inefficiency, and compromised accuracy. To overcome these limitations, we propose CrownViM, a novel architecture based on a bidirectional State Space Model (SSM). The framework integrates a linear-complexity Context Clustering Vision Mamba (CCViM) encoder for efficient global context modeling and employs a MaskFormer decoder for precise boundary prediction. We further introduce a partial-supervision loss function to reduce dependence on exhaustively annotated crown masks. Evaluations on OAM-TCD and the single-tree segmentation dataset (SSD) show CrownViM achieves significant segmentation accuracy improvements while maintaining a lightweight profile (39.6 M parameters). It substantially outperforms Convolutional Neural Network (CNN), Vision Transformer (ViT), and hybrid-based baselines when processing overlapping crowns and structurally complex scenes. As the first implementation of state space models in ITCD, CrownViM effectively addresses core limitations in global context capture, computational efficiency, and boundary definition. Our efficient architecture and sparse-annotation loss strategy enable high-accuracy, robust individual tree mapping, advancing tools for large-scale forest monitoring and accurate carbon stock quantification.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Shi et al. (Wed,) studied this question.

synapsesocial.com/papers/69b3abc502a1e69014ccce6c https://doi.org/https://doi.org/10.3390/rs18060860

Bookmark

View Full Paper