Clarifying questions enable conversational search (CS) systems to resolve underspecified queries by eliciting missing information from users. However, how prompting strategies shape the quality of clarifying questions and how such questions should be evaluated at scale remains understudied. We present AGENT-CQ ( A utomatic GEN eration and evalua T ion of C larifying Q uestions), a framework for systematically generating and evaluating clarifying questions and simulated user responses using large language models (LLMs). To support scalable and multi-perspective evaluation, we introduce CrowdLLM , an LLM-based evaluation paradigm that simulates diverse annotator judgments through distinct evaluator personas. Our experiments span both open-domain conversational search and a regulatory question-answering setting, allowing us to examine the extent to which clarification strategies generalize across domains with different interaction constraints. Across settings, temperature-variation prompting leads to higher quality clarifying questions than baseline prompting and human-authored questions on several dimensions of the task. In addition, LLM-generated clarifying questions lead to improved downstream retrieval performance than human-authored questions in open-domain search. Together, AGENT-CQ and CrowdLLM provide a practical framework for studying and improving clarification strategies in conversational IR systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Clemencia Siro
Yifei Yuan
Mohammad Aliannejadi
ACM Transactions on Information Systems
University of Copenhagen
University of Amsterdam
ETH Zurich
Building similarity graph...
Analyzing shared references across papers
Loading...
Siro et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e1cf625cdc762e9d858446 — DOI: https://doi.org/10.1145/3809182