What question did this study set out to answer?

This research aims to systematically evaluate the robustness of composed image retrieval methods against common corruptions and variations in text descriptions.

April 20, 2026Open Access

BenchCIR: Benchmarking robustness in composed image retrieval across modalities

Puntos clave

This research aims to systematically evaluate the robustness of composed image retrieval methods against common corruptions and variations in text descriptions.
Introduced three new benchmark datasets: CIRR-C, FashionIQ-C, and CIRR-D.
Evaluated robustness to 75 visual corruptions and 35 textual variations.
Developed BenchCIR as an open-source testbed for model evaluation.
Benchmarked ten existing composed image retrieval models.
Revealed insights into how visual and textual modalities contribute to model robustness.
Identified gaps in robustness under different types of textual variations.
Showed performance differences among various models when exposed to common corruptions.

Resumen

Composed image retrieval aims to retrieve images based on a query that consists of a reference image and text describing desired modifications to that image. It has recently attracted attention for its ability to tailor image retrieval to user intentions by combining information-rich reference images with concise natural language instructions. Despite its current success, the robustness of composed image retrieval methods to either (1) common corruptions or (2) variations of the textual descriptions have never been systematically evaluated. In this paper, we perform the first robustness study of composed image retrieval, establishing three new benchmarks for a systematic evaluation of robustness to common corruption (in both the textual and visual domains) and robustness in text understanding. For analysis of natural image corruption, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C, for the open domains and fashion domains respectively–both of which feature 75 visual corruptions and 35 textual corruptions. To facilitate robust evaluation of text understanding, we introduce a new diagnostic dataset CIRR-D by expanding the CIRR dataset with synthetic data, specifically probing text understanding across variations in: numerical, attribute, object removal, and background. We introduce BenchCIR, a testbed for evaluating composed image retrieval model robustness with standardized evaluation protocols. Through benchmarking ten published models in the testbed, we reveal insights into how the composition of visual and textual modalities affects model robustness. The code is in https://suntongtongtong.github.io/BenchCIR/ • First robust study of composed image retrieval under multimodal corruption. • Four categories of textual variation were introduced for systematic robustness analysis. • New large-scale benchmarks: FashionIQ-C, CIRR-C, and diagnostic CIRR-D dataset. • BenchCIR: an open-source testbed for robust evaluation of retrieval models. • Extensive experiments reveal modality-specific contributions to model robustness.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo