Los puntos clave no están disponibles para este artículo en este momento.
Introduction Visual grounding aims to localize target objects in images based on given textual descriptions, with broad applications in fields such as autonomous driving and human-robot interaction. However, existing visual grounding models still face three major challenges: (1) Most prior works employ separate encoders to process images and text independently, which enlarges the semantic gap between visual and textual features; (2) The use of large-language models leads to excessive parameters, making deployment on lightweight devices difficult; (3) Single-level cross-modal attention mechanisms are insufficient for fully capturing interactive information across modalities. Methods To address these issues, this paper proposes a Task-aware Liquid Cross-modal Network (TLCN), which consists of four key modules: a Feature Extraction Module (FEM), a Liquid Fusion Module (LFM), a Task-aware Cross-modal Refinement Module (TCRM), and a Multilevel Grounding Module (MGM). Specifically, the FEM utilizes textual features to guide the extraction of visual features, thereby reducing the feature gap. The LFM employs Liquid Neural Networks (LNNs) to capture temporal dependencies and significantly reduce model parameters. Furthermore, the TCRM deepens textual representation via a second-level attention mechanism, while designed Conv-Trans Blocks (CTBs) are applied to image data to extract deeper visual features. Additionally, a similarity loss function based on KL divergence is introduced to optimize the cross-modal alignment. Results The proposed model is extensively evaluated on three widely-used public benchmarks: RefCOCO, RefCOCO+, and RefCOCOg. Moreover, a specialized text localization task is designed for further evaluation. Experimental results demonstrate that the TLCN achieves superior performance across all evaluated datasets and tasks. Discussion The superior performance of TLCN validates the effectiveness of its structural designs: text-guided visual extraction successfully bridges the semantic gap, the introduction of LNNs effectively reduces parameter counts for lightweight deployment, and the second-level attention with CTBs sufficiently captures deep cross-modal interactions. These findings suggest that TLCN provides a promising, efficient, and lightweight solution for visual grounding and related localization tasks.
Li et al. (Fri,) studied this question.