Contrastive Multimodal Transformers for Zero-Shot Cross-Domain Vision-Language Retrieval Tasks focus on aligning heterogeneous modalities to retrieve semantically related content without explicit training on the target domain. This paradigm enhances the adaptability of retrieval models across diverse visual and textual datasets. However, existing methods often suffer from weak cross-domain generalization due to distributional shifts between training and unseen domains. They also struggle with limited alignment quality, as visual and textual embeddings fail to capture domain-specific semantics effectively. To address these challenges, we propose the Contrastive Multimodal Transformer with Domain-Adaptive Pretraining (CMT-DAP). The framework integrates multimodal transformers with a domain-adaptive contrastive learning stage, where large-scale unlabeled image–text pairs from multiple domains are leveraged to learn invariant embeddings. This ensures robust semantic alignment across modalities while improving zero-shot generalization. The proposed method can be effectively applied in areas such as medical image–report retrieval, cross-lingual multimedia search, and e-commerce product–review alignment. Specifically, it enables retrieval of accurate and semantically relevant results even when the target domain differs from the training domain. Experimental findings demonstrate that CMT-DAP outperforms existing approaches in retrieval accuracy, robustness to domain shifts, and semantic consistency, establishing it as a promising solution for cross-domain multimodal applications.
Dewangan et al. (Thu,) studied this question.