Contextualized language models offer new opportunities for mining materials-science information from literature, but progress is limited by the absence of domain-specific question-answering (QA) data sets. This study addresses this by introducing MechQA, a data set of 202,068 pairs of questions and answers about mechanical properties that have been automatically distilled from 125,967 articles in the literature. Unlike small manually curated QA benchmarks or approaches that rely on domain-specific pretraining, MechQA provides a large-scale, automatically generated training resource derived directly from the primary literature. It covers five fundamental mechanical properties of materials: ultimate tensile strength, yield strength, fracture strength, Young's modulus, and ductility. Manual evaluation of this data set confirmed its high quality (precision 83.76%, recall 89.09%, F1 score 86.34%). We apply MechQA to fine-tune three representative transformer models: two extractive models, BERT-base and XLNet-base, each with 110 M parameters, and a generative LLaMA-3.1-Instruct model with 8B parameters fine-tuned using low-rank adaptation (LoRA). The MechQA data set was partitioned into 181,722 training and 20,346 validation QA pairs for this application. On the validation set, domain-specific extractive models achieve strong Exact Match (EM) and F1 score performance (BERT: 78.03% EM/84.50% F1; XLNet: 78.21% EM/84.70% F1) with improved expected calibration error (ECE) of 7.98% and 6.25%, respectively, while the LLaMA-domain model achieves 80.48% EM/86.25% F1 with an ECE of 8.08%. Notably, the two extractive models exhibit competitive performance despite their significantly smaller parameter size compared to the LLaMA model. These results demonstrate that automatic QA data set generation, coupled with targeted fine-tuning, provides an effective data-centric method for domain adaptation of language models for materials science.
Zhang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: