Chat-based image retrieval uses Large Language Models (LLM) to guide user input to enable more specific and precise search results, where LLM can enhance this process by asking user retrieval-oriented questions eliciting additional details about the target image. Despite the potential of this approach, no specialized Questioner model has been developed for this task due to the following significant challenges: (a) the difficulty of determining the optimal questions to ask; (b) the lack of a suitable protocol for fair model comparison; and (c) the notable scarcity of dialog-to-image retrieval data. To address these challenges, two fundamental principles are developed in this paper to ensure the simplicity and effectiveness of the generated questions while enabling a fair comparison and accurate estimation of data quality and model performance. A bootstrap training methodology is introduced to collect retrieval-oriented dialog data and concurrently train the Questioner and the image Retriever. Under a fair comparison protocol, our extensive experiments have demonstrated that our proposed method can not only address the critical data gap, but also achieve state-of-the-art results, which substantially surpass GPT-4o and GPT-4-Turbo through the fine-tuning of an 8B model.
Chen et al. (Wed,) studied this question.