What does this research mean for the field?

Formula-constrained prompt engineering significantly improves the accuracy and expert-alignment of large language models when assessing the risk of bias in randomized controlled trials. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to assess how prompt engineering influences the performance of large language models in risk of bias assessments.

May 19, 2026

Impact of prompt engineering on large language models for risk of bias assessment: a comparative study

Key Points

The study aims to assess how prompt engineering influences the performance of large language models in risk of bias assessments.
Evaluated 158 randomized controlled trials from 10 dental systematic reviews.
Compared two large language models, DeepSeek-V3 and GPT-5, across four prompting strategies.
Measured accuracy and agreement metrics according to RoB-1 domains.
Mean accuracy under blank control was 0.72 for DeepSeek-V3 and 0.65 for GPT-5.
With formula-constrained prompting, accuracy rose to 0.85 for both models (p<0.001).
Formula-constrained prompts resulted in significantly higher agreement and better alignment with expert reasoning (MMD² results).

Abstract

Objectives To evaluate the performance of large language models (LLMs) in risk of bias assessment and to examine whether prompt engineering improves their accuracy and alignment with expert reasoning. Methods We analysed 158 randomised controlled trials from 10 dental systematic reviews and their risk of bias assessments were reviewed and revised to serve as the reference standard. Two LLMs (DeepSeek-V3 and GPT-5) were evaluated under four prompting strategies, including direct command, command with reference, constrained output and formula-constrained output. The direct command served as the blank control group, simulating the approach commonly used by clinicians, whereas the other three groups employed different prompt engineering. The performance of LLMs across the seven domains of RoB-1 was evaluated using accuracy and agreement. The reasoning process of the LLMs was expressed in the form of syllogisms and its similarity to expert reasoning was assessed using MMD 2 . Results LLMs showed limited capability in risk of bias assessment under the blank control condition, with mean accuracies of 0.72 for DeepSeek-V3 and 0.65 for GPT-5. With formula-constrained prompting, the performance of both LLMs improved significantly, and the overall accuracy increased to 0.85 for both DeepSeek-V3 and GPT-5 (both vs the blank control group, p<0.001). Agreement metrics showed a similar pattern, with higher agreement under formula-constrained prompting than under the other prompting strategies (p<0.001 for both models). In addition, the syllogistic output format provided a clear representation of the reasoning process underlying risk of bias assessment. Compared with constrained output, formula-constrained prompting also produced reasoning that was more closely aligned with the reference answers, as indicated by lower MMD² values (DeepSeek-V3: 0.0765 vs 0.1239; GPT-5: 0.0548 vs 0.1068). Conclusion Prompt engineering substantially improved the performance of LLMs in risk of bias assessment. Although LLMs cannot currently replace human reviewers, they may serve as efficient and transparent tools to support this process.

Bookmark

Impact of prompt engineering on large language models for risk of bias assessment: a comparative study

Key Points

Abstract

Cite This Study