With the proliferation of multimodal content on social media, creators increasingly require tools that can retrieve both images and videos relevant to a single textual query. However, existing cross-modal retrieval methods are typically confined to binary (text–image or text–video) settings and struggle with fine-grained semantic alignment and spatiotemporal information imbalance. To address this issue, we propose UniTriM, a unified framework for text–image–video joint retrieval. First, UniTriM supports concurrent retrieval of semantically relevant images and videos given one textual input. To overcome the scarcity of text–image–video triplet data, we introduce a self-attention-based keyframe selection strategy that converts existing text–video datasets into triplet format. Second, we design a multi-granularity similarity alignment module that captures hierarchical semantics by modeling patch–frame–video and word–triple–sentence structures and jointly optimizes intra- and cross-granularity alignments to enhance fine-grained cross-modal correspondence. Third, to alleviate the inherent spatiotemporal information imbalance between static images and video-aligned text descriptions, we introduce a feature disentanglement module that disentangles spatial-related features from text and aligns them explicitly with image representations. Experiments conducted on three benchmark datasets MSR-VTT, MSVD, and DiDeMo demonstrate that UniTriM achieves state-of-the-art performance on joint retrieval tasks.
Wang et al. (Mon,) studied this question.