Open-vocabulary semantic segmentation (OVSS) aims to achieve pixel-level object segmentation guided by arbitrary natural language descriptions. Although pre-trained vision–language models (VLMs) have significantly advanced the development of OVSS, their reliance on the Vision Transformer (ViT) architecture imposes a fundamental constraint on dense prediction. Specifically, the absence of hierarchical downsampling in ViT-based VLM results in single-scale representations that trade spatial localization for global semantics. To address these issues, this paper proposes a hierarchical boundary-constrained decoding network for OVSS, called CLIP-HBD. Our approach leverages VLM semantic priors to reconstruct multi-scale features and introduces a boundary-constrained decoding strategy to refine edge details. Specifically, CLIP-HBD leverages a ConvNeXt-based backbone alongside a hierarchical adaptation mechanism to fuse multi-layer VLM features, generating a comprehensive multi-scale representation. To address the issue of boundary inaccuracy, we perform explicit boundary prediction based on multi-scale representations, where the resulting boundary maps are subsequently transformed into structural constraints to steer the decoder’s focus toward boundary regions. By integrating structural constraints with hierarchical features, the decoding process effectively maintains semantic consistency and restores precise object boundaries. Extensive experiments demonstrate that CLIP-HBD achieves superior performance in both segmentation precision and boundary quality across multiple benchmarks.
Wáng et al. (Fri,) studied this question.