What question did this study set out to answer?

The research aims to improve question-answering capabilities within specialized domains by developing a two-stage framework for contextual inquiry.

March 2, 2026Open Access

Learning to ask and answer in specialized documents: Exemplifying through modular integrated construction regulatory documents

Read Full Paperexternally

Key Points

The research aims to improve question-answering capabilities within specialized domains by developing a two-stage framework for contextual inquiry.
Developed a two-stage framework: 'learning to ask' and 'learning to answer'.
Created question-answer pairs from regulatory document contexts using fine-tuned large language models.
Compared various contextual question-answering paradigms including fine-tuning and retrieval-augmented generation.
Synthetic data generation effectively mirrors training distributions, needing efficient filtering.
Synthetic data maintains 90–100% performance compared to original data when using suitable models.
Fine-tuned, domain-specific models outperform generic models, highlighting their importance.

Abstract

Large language models perform well in general question-answering tasks but face challenges in local contextual question-answering within specialized domains due to the high cost of domain-specific dataset curation and unstable model performance. To address these issues, this paper proposes a two-stage framework. In the first stage, “learning to ask,” fine-tuned LLMs generate question-answering pairs from contexts, guided by contextual relevance (i.e., question-context alignment) and answer fidelity (i.e., accuracy and faithfulness of the answer). The second stage, “learning to answer,” systematically compares different CQA paradigms, including fine-tuning, retrieval-augmented generation, and proprietary LLMs. The framework is demonstrated using modular integrated construction regulatory documents. Extensive experiments yield three main insights: (1) Synthetic data generation often mirrors training distributions, necessitating effective filtering; (2) Despite inherent biases, synthetic data retains 90–100% of the performance achieved with original data and appropriate models, demonstrating its practical utility; and (3) Domain-specific, fine-tuned models achieve the best performance, underscoring the importance of tailored adaptation. This work bridges gaps in synthetic data quality assurance and domain-aware language model customization, providing practical guidelines for applications in low-resource, expertise-driven, and privacy-sensitive domains. • Guidelines for local contextual question-answering systems in specialized domains. • Synthetic data generation for extracting question-answer pairs from contexts. • Introduction of quantitative metrics for assessing the quality of synthetic data . • Comparative analysis for establishing contextual question-answering paradigms.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yinyi Wei

Xiao Li

Zhenbang Huang

Journals

Computers in Industry

Actions

Institutions

University of Hong Kong

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Learning to ask and answer in specialized documents: Exemplifying through modular integrated construction regulatory documents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study