Semantic segmentation provides essential scene understanding for unmanned ground vehicles to identify obstacles and plan paths in unstructured environments. Nevertheless, existing methodologies tailored for these settings typically necessitate linear probing or fine-tuning to accommodate novel scenarios, thereby suffering from a deficiency in zero-shot transferability. In response to this limitation, our study introduces a novel framework designed for robust zero-shot transfer in unstructured domains, capitalizing on the superior visual-linguistic alignment capabilities of the EVA-CLIP architecture. To augment segmentation precision, we initially utilize deep prompt tuning to adapt the visual feature extraction efficacy of the EVA-CLIP image encoder to unstructured terrain features. This strategy not only bolsters adaptability to irregular environments but also preserves the intrinsic zero-shot proficiency of the underlying model. Concurrently, we devise an ensemble prompt engineering scheme customized for unstructured settings to further elevate segmentation outcomes. Moreover, the framework optimizes the correspondence between text and images by integrating global and local representations from the respective encoders, thereby maximizing cross-modal alignment for superior segmentation. Empirical evaluations indicate that our methodology surpasses contemporary state-of-the-art techniques, yielding an increase in mIoU ranging from 1.2% to 43.9% on the Robot Unstructured Ground Driving (RUGD) benchmark. Furthermore, evaluations on the Rellis-3D dataset reveal that the model’s cross-domain zero-shot performance rivals that of supervised fine-tuning approaches, demonstrating robust generalization to previously unseen semantic classes.
Zhou et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: