Composed image retrieval aims to retrieve images based on a query that consists of a reference image and text describing desired modifications to that image. It has recently attracted attention for its ability to tailor image retrieval to user intentions by combining information-rich reference images with concise natural language instructions. Despite its current success, the robustness of composed image retrieval methods to either (1) common corruptions or (2) variations of the textual descriptions have never been systematically evaluated. In this paper, we perform the first robustness study of composed image retrieval, establishing three new benchmarks for a systematic evaluation of robustness to common corruption (in both the textual and visual domains) and robustness in text understanding. For analysis of natural image corruption, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C, for the open domains and fashion domains respectively–both of which feature 75 visual corruptions and 35 textual corruptions. To facilitate robust evaluation of text understanding, we introduce a new diagnostic dataset CIRR-D by expanding the CIRR dataset with synthetic data, specifically probing text understanding across variations in: numerical, attribute, object removal, and background. We introduce BenchCIR, a testbed for evaluating composed image retrieval model robustness with standardized evaluation protocols. Through benchmarking ten published models in the testbed, we reveal insights into how the composition of visual and textual modalities affects model robustness. The code is in https://suntongtongtong.github.io/BenchCIR/ • First robust study of composed image retrieval under multimodal corruption. • Four categories of textual variation were introduced for systematic robustness analysis. • New large-scale benchmarks: FashionIQ-C, CIRR-C, and diagnostic CIRR-D dataset. • BenchCIR: an open-source testbed for robust evaluation of retrieval models. • Extensive experiments reveal modality-specific contributions to model robustness.
Sun et al. (Wed,) studied this question.