Conversational search (CS) addresses users’ information needs through multi-turn and context-aware interactions. Given that user queries are often ambiguous, the use of clarifying questions can effectively reduce uncertainty and enable a mixed-initiative conversational system. However, current datasets for clarifying questions remain limited in the following three aspects: (1) underrepresented multi-turn conversational data, (2) limited diversity, and (3) heavily reliance on crowdsourcing, thereby suffering from limitations such as high annotation cost. To address these issues, we propose a LLM-based three stage framework that relies on an existing community question answering (CQA) dataset. It encompasses: (1) extracting essential information from the initial user query with the relevant contextual information, (2) generating clarifying questions paired with corresponding answers, and (3) refining conversations to ensure coherence and a natural conversational flow. We assess our multi-stage method against a baseline that directly prompts LLMs to generate conversations in a single-step process, evaluating on an answer retrieval task using recall, precision, nDCG and MAP. Results show that our three-stage generation approach consistently outperforms the baseline particularly in recall, while also achieving competitive results across other metrics. Human and automatic evaluations further indicate the high quality of generated conversations and fine-tuning on them improves retrieval performance, highlighting the pipeline's potential.
Lu et al. (Wed,) studied this question.