Objectives To evaluate the performance of large language models (LLMs) in risk of bias assessment and to examine whether prompt engineering improves their accuracy and alignment with expert reasoning. Methods We analysed 158 randomised controlled trials from 10 dental systematic reviews and their risk of bias assessments were reviewed and revised to serve as the reference standard. Two LLMs (DeepSeek-V3 and GPT-5) were evaluated under four prompting strategies, including direct command, command with reference, constrained output and formula-constrained output. The direct command served as the blank control group, simulating the approach commonly used by clinicians, whereas the other three groups employed different prompt engineering. The performance of LLMs across the seven domains of RoB-1 was evaluated using accuracy and agreement. The reasoning process of the LLMs was expressed in the form of syllogisms and its similarity to expert reasoning was assessed using MMD 2 . Results LLMs showed limited capability in risk of bias assessment under the blank control condition, with mean accuracies of 0.72 for DeepSeek-V3 and 0.65 for GPT-5. With formula-constrained prompting, the performance of both LLMs improved significantly, and the overall accuracy increased to 0.85 for both DeepSeek-V3 and GPT-5 (both vs the blank control group, p<0.001). Agreement metrics showed a similar pattern, with higher agreement under formula-constrained prompting than under the other prompting strategies (p<0.001 for both models). In addition, the syllogistic output format provided a clear representation of the reasoning process underlying risk of bias assessment. Compared with constrained output, formula-constrained prompting also produced reasoning that was more closely aligned with the reference answers, as indicated by lower MMD² values (DeepSeek-V3: 0.0765 vs 0.1239; GPT-5: 0.0548 vs 0.1068). Conclusion Prompt engineering substantially improved the performance of LLMs in risk of bias assessment. Although LLMs cannot currently replace human reviewers, they may serve as efficient and transparent tools to support this process.
Xiong et al. (Sun,) studied this question.