What question did this study set out to answer?

The research aims to enhance part-level segmentation in robotic applications using visual language models.

April 1, 2026Open Access

Hierarchical Compositional Alignment for Zero-Shot Part-Level Segmentation

Puntos clave

The research aims to enhance part-level segmentation in robotic applications using visual language models.
Developed a one-stage VLM-based part segmentation method.
Implemented Hierarchy-Aware Feature Selection to analyze features at different hierarchies.
Utilized a Multi-Hierarchy Feature Adapter to bridge object-to-part feature granularity.
Created a Hierarchical Multimodal Alignment Module to improve vision-language alignment.
Achieved 25.86% accuracy on Pascal-Part dataset.
Achieved 13.09% accuracy on ADE20K-Part dataset.
Gained +0.81% hIoU and +2.96% hIoU improvements over the baseline.

Resumen

In robotic fine-grained tasks (e.g., grasping and assembly), precise interaction requires a detailed understanding of object components. While Visual Language Models (VLMs) excel at object-level recognition, they struggle with part-level segmentation (e.g., knife handles), limiting performance in complex scenarios. VLMs face three key challenges: (1) Visual granularity mismatch—object-level features lack part-level details; (2) Semantic hierarchy gaps—parts and objects differ significantly in semantics; (3) Cross-modal bias—CLIP’s text–image alignment favors global over local features. To address these, we propose a one-stage VLM-based part segmentation method. First, the Hierarchy-Aware Feature Selection mechanism analyzes Transformer features in different hierarchies to enhance spatial and semantic precision for part segmentation. Second, the Multi-Hierarchy Feature Adapter bridges object-to-part feature granularity via the hierarchical adaptation. Finally, the Hierarchical Multimodal Alignment Module harmonizes classification accuracy and mask integrity via hierarchical alignment of vision–language, mitigating the bias of CLIP’s object-level priori knowledge. Experiments show the proposed method improves part segmentation performance for Zero-Shot, achieving 25.86% on Pascal-Part and 13.09% on ADE20K-Part (gains of +0.81% hIoU and +2.96% hIoU over baseline). This work advances robotic visual perception, with applications in intelligent manufacturing and intelligent service.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo