What question did this study set out to answer?

The aim is to create a large-scale question-answering data set specifically for mechanical properties to improve language model performance.

March 29, 2026Open Access

Automatic Generation of a Mechanical Properties Question-Answering Data Set for Language Model Benchmarking: A Comparative Study of BERT, XLNet, and LLaMA Models

Puntos clave

The aim is to create a large-scale question-answering data set specifically for mechanical properties to improve language model performance.
Developed MechQA data set with 202,068 QA pairs from 125,967 articles.
Conducted manual evaluation of QA pairs for precision and recall.
Applied MechQA to fine-tune BERT-base, XLNet-base, and LLaMA-3.1-Instruct models.
BERT and XLNet achieved high performance on the validation set (EM: 78.03% and 78.21%, F1: 84.50% and 84.70%).
LLaMA model achieved 80.48% EM and 86.25% F1 on the validation set.
All models showed improved expected calibration error compared to baseline.

Resumen

Contextualized language models offer new opportunities for mining materials-science information from literature, but progress is limited by the absence of domain-specific question-answering (QA) data sets. This study addresses this by introducing MechQA, a data set of 202,068 pairs of questions and answers about mechanical properties that have been automatically distilled from 125,967 articles in the literature. Unlike small manually curated QA benchmarks or approaches that rely on domain-specific pretraining, MechQA provides a large-scale, automatically generated training resource derived directly from the primary literature. It covers five fundamental mechanical properties of materials: ultimate tensile strength, yield strength, fracture strength, Young's modulus, and ductility. Manual evaluation of this data set confirmed its high quality (precision 83.76%, recall 89.09%, F1 score 86.34%). We apply MechQA to fine-tune three representative transformer models: two extractive models, BERT-base and XLNet-base, each with 110 M parameters, and a generative LLaMA-3.1-Instruct model with 8B parameters fine-tuned using low-rank adaptation (LoRA). The MechQA data set was partitioned into 181,722 training and 20,346 validation QA pairs for this application. On the validation set, domain-specific extractive models achieve strong Exact Match (EM) and F1 score performance (BERT: 78.03% EM/84.50% F1; XLNet: 78.21% EM/84.70% F1) with improved expected calibration error (ECE) of 7.98% and 6.25%, respectively, while the LLaMA-domain model achieves 80.48% EM/86.25% F1 with an ECE of 8.08%. Notably, the two extractive models exhibit competitive performance despite their significantly smaller parameter size compared to the LLaMA model. These results demonstrate that automatic QA data set generation, coupled with targeted fine-tuning, provides an effective data-centric method for domain adaptation of language models for materials science.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Zhang et al. (Thu,) studied this question.

synapsesocial.com/papers/69c8c22cde0f0f753b39c6d9 https://doi.org/https://doi.org/10.1021/acs.jcim.5c02646

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo