What question did this study set out to answer?

The aim is to create a unified framework for retrieving images and videos from a single text query while addressing cross-modal alignment challenges.

April 1, 2026Open Access

UniTriM: Unified Text–Image–Video Retrieval via Multi-Granular Alignment and Feature Disentanglement

Key Points

The aim is to create a unified framework for retrieving images and videos from a single text query while addressing cross-modal alignment challenges.
Introduced UniTriM framework for concurrent retrieval of images and videos given a text input.
Developed a self-attention-based keyframe selection strategy for converting datasets into triplet format.
Designed a multi-granularity similarity alignment module that captures hierarchical semantic structures.
Implemented feature disentanglement to manage spatial-related features and enhance cross-modal alignment.
UniTriM achieves state-of-the-art performance in joint retrieval tasks on benchmark datasets.
Effectively addresses issues of semantic alignment and spatiotemporal information imbalance.

Abstract

With the proliferation of multimodal content on social media, creators increasingly require tools that can retrieve both images and videos relevant to a single textual query. However, existing cross-modal retrieval methods are typically confined to binary (text–image or text–video) settings and struggle with fine-grained semantic alignment and spatiotemporal information imbalance. To address this issue, we propose UniTriM, a unified framework for text–image–video joint retrieval. First, UniTriM supports concurrent retrieval of semantically relevant images and videos given one textual input. To overcome the scarcity of text–image–video triplet data, we introduce a self-attention-based keyframe selection strategy that converts existing text–video datasets into triplet format. Second, we design a multi-granularity similarity alignment module that captures hierarchical semantics by modeling patch–frame–video and word–triple–sentence structures and jointly optimizes intra- and cross-granularity alignments to enhance fine-grained cross-modal correspondence. Third, to alleviate the inherent spatiotemporal information imbalance between static images and video-aligned text descriptions, we introduce a feature disentanglement module that disentangles spatial-related features from text and aligns them explicitly with image representations. Experiments conducted on three benchmark datasets MSR-VTT, MSVD, and DiDeMo demonstrate that UniTriM achieves state-of-the-art performance on joint retrieval tasks.

UniTriM: Unified Text–Image–Video Retrieval via Multi-Granular Alignment and Feature Disentanglement

Key Points

Abstract

Cite This Study