What question did this study set out to answer?

The aim is to enhance zero-shot semantic segmentation in unstructured environments using EVA-CLIP model features.

June 28, 2026Open Access

Cross-domain zero-shot semantic segmentation for unstructured environments via EVA-CLIP model, ensemble prompt engineering, and optimized text-image matching

Puntos clave

The aim is to enhance zero-shot semantic segmentation in unstructured environments using EVA-CLIP model features.
Utilized deep prompt tuning for visual feature enhancement in EVA-CLIP.
Developed ensemble prompt engineering for unstructured environments.
Optimized text-image matching by integrating global and local representations from encoders.
Achieved mIoU improvements ranging from 1.2% to 43.9% on the Robot Unstructured Ground Driving benchmark.
Demonstrated that the model's zero-shot performance is competitive with supervised fine-tuning approaches on the Rellis-3D dataset.

Resumen

Semantic segmentation provides essential scene understanding for unmanned ground vehicles to identify obstacles and plan paths in unstructured environments. Nevertheless, existing methodologies tailored for these settings typically necessitate linear probing or fine-tuning to accommodate novel scenarios, thereby suffering from a deficiency in zero-shot transferability. In response to this limitation, our study introduces a novel framework designed for robust zero-shot transfer in unstructured domains, capitalizing on the superior visual-linguistic alignment capabilities of the EVA-CLIP architecture. To augment segmentation precision, we initially utilize deep prompt tuning to adapt the visual feature extraction efficacy of the EVA-CLIP image encoder to unstructured terrain features. This strategy not only bolsters adaptability to irregular environments but also preserves the intrinsic zero-shot proficiency of the underlying model. Concurrently, we devise an ensemble prompt engineering scheme customized for unstructured settings to further elevate segmentation outcomes. Moreover, the framework optimizes the correspondence between text and images by integrating global and local representations from the respective encoders, thereby maximizing cross-modal alignment for superior segmentation. Empirical evaluations indicate that our methodology surpasses contemporary state-of-the-art techniques, yielding an increase in mIoU ranging from 1.2% to 43.9% on the Robot Unstructured Ground Driving (RUGD) benchmark. Furthermore, evaluations on the Rellis-3D dataset reveal that the model’s cross-domain zero-shot performance rivals that of supervised fine-tuning approaches, demonstrating robust generalization to previously unseen semantic classes.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo