Considering the problems that large language models are prone to experiencing logical confusion and insufficient ability to capture of implicit relationships when dealing with complex reasoning tasks, this paper proposes and constructs a high-quality dataset of multi-type reasoning question answering dataset (MTR-QA). By collecting and sorting the past 15 years of civil service examination questions and authoritative mock question banks, a multi-stage text processing framework including text standardisation, hash deduplication, near-duplicate detection and semantic embedding similarity filtering was used to achieve data cleaning, which effectively reduces redundancy and noise interference. To ensure the reliability of the data, this paper designs a multi-model evaluation mechanism integrating GPT-4, DeepSeek-V1 and Qwen-2.5, which quantitatively evaluates the data across four dimensions: integrity (CPL), accuracy (ACC), security (SFC) and chain of thought quality (CoT-Q). In the end, 24,312 high-quality data entries were selected and stored in json format with a size of 34.1 MB. Each data sample contains six attributes: question, options, answer, chain of thought, type, and level of difficulty, and is divided into four core types of reasoning: logic, semantics, mathematics and comprehensive knowledge reasoning. The MTR-QA dataset has been expanded in terms of reasoning types and topic breadth, providing a reliable data base for various reasoning tasks such as pre-training large language models, supervised fine-tuning and model evaluation, and promote the performance improvement of large language models in complex reasoning and Q&A scenarios.
WANG et al. (Sun,) studied this question.