In robotic fine-grained tasks (e.g., grasping and assembly), precise interaction requires a detailed understanding of object components. While Visual Language Models (VLMs) excel at object-level recognition, they struggle with part-level segmentation (e.g., knife handles), limiting performance in complex scenarios. VLMs face three key challenges: (1) Visual granularity mismatch—object-level features lack part-level details; (2) Semantic hierarchy gaps—parts and objects differ significantly in semantics; (3) Cross-modal bias—CLIP’s text–image alignment favors global over local features. To address these, we propose a one-stage VLM-based part segmentation method. First, the Hierarchy-Aware Feature Selection mechanism analyzes Transformer features in different hierarchies to enhance spatial and semantic precision for part segmentation. Second, the Multi-Hierarchy Feature Adapter bridges object-to-part feature granularity via the hierarchical adaptation. Finally, the Hierarchical Multimodal Alignment Module harmonizes classification accuracy and mask integrity via hierarchical alignment of vision–language, mitigating the bias of CLIP’s object-level priori knowledge. Experiments show the proposed method improves part segmentation performance for Zero-Shot, achieving 25.86% on Pascal-Part and 13.09% on ADE20K-Part (gains of +0.81% hIoU and +2.96% hIoU over baseline). This work advances robotic visual perception, with applications in intelligent manufacturing and intelligent service.
Yang et al. (Mon,) studied this question.