What question did this study set out to answer?

This paper aims to improve open-vocabulary semantic segmentation by enhancing the decoding process through hierarchical boundary constraints. The focus is on addressing boundary inaccuracy in dense predictions.

May 20, 2026Open Access

CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation

Key Points

This paper aims to improve open-vocabulary semantic segmentation by enhancing the decoding process through hierarchical boundary constraints. The focus is on addressing boundary inaccuracy in dense predictions.
Proposed CLIP-HBD network utilizing a ConvNeXt-based backbone and hierarchical adaptation mechanism.
Implemented boundary-constrained decoding strategy for edge detail refinement.
Leveraged multi-layer vision-language model features for generating multi-scale representations.
CLIP-HBD demonstrates superior segmentation precision compared to standard methods, achieving a significant increase in boundary quality.
Experiments show improved segmentation results across multiple benchmarks, supporting its efficacy in precise object boundary restoration.

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to achieve pixel-level object segmentation guided by arbitrary natural language descriptions. Although pre-trained vision–language models (VLMs) have significantly advanced the development of OVSS, their reliance on the Vision Transformer (ViT) architecture imposes a fundamental constraint on dense prediction. Specifically, the absence of hierarchical downsampling in ViT-based VLM results in single-scale representations that trade spatial localization for global semantics. To address these issues, this paper proposes a hierarchical boundary-constrained decoding network for OVSS, called CLIP-HBD. Our approach leverages VLM semantic priors to reconstruct multi-scale features and introduces a boundary-constrained decoding strategy to refine edge details. Specifically, CLIP-HBD leverages a ConvNeXt-based backbone alongside a hierarchical adaptation mechanism to fuse multi-layer VLM features, generating a comprehensive multi-scale representation. To address the issue of boundary inaccuracy, we perform explicit boundary prediction based on multi-scale representations, where the resulting boundary maps are subsequently transformed into structural constraints to steer the decoder’s focus toward boundary regions. By integrating structural constraints with hierarchical features, the decoding process effectively maintains semantic consistency and restores precise object boundaries. Extensive experiments demonstrate that CLIP-HBD achieves superior performance in both segmentation precision and boundary quality across multiple benchmarks.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wáng et al. (Fri,) studied this question.

synapsesocial.com/papers/6a0d50dcf03e14405aa9cf63 https://doi.org/https://doi.org/10.3390/computers15050318

Bookmark

View Full Paper