What question did this study set out to answer?

The study aims to evaluate the quality of multiple-choice questions reviewed by ChatGPT-4 compared to those created by medical faculty.

May 16, 2026Open Access

ChatGPT as a tool for reviewing multiple-choice questions in the health sector

Key Points

The study aims to evaluate the quality of multiple-choice questions reviewed by ChatGPT-4 compared to those created by medical faculty.
Cross-sectional quantitative study design
Assessment by ten external health education specialists and four study authors
Use of descriptive statistics, Wilcoxon Signed-Rank Test, and Non-Metric Multidimensional Scaling
External evaluators found no significant difference in criteria met between MCQs (p = 0.325)
Study authors identified a significant increase in criteria met by AI-reviewed MCQs (p < 0.001)
ChatGPT-4 improved structural clarity and adherence to item-writing principles but struggled with clinical reasoning when prompts were non-optimized.

Abstract

Artificial Intelligence (AI), particularly ChatGPT-4, offers promising applications in medical education, including multiple-choice question (MCQ) development. This study aimed to evaluate and compare the quality of 36 MCQs created by medical faculty with their versions reviewed by ChatGPT-4. A cross-sectional, quantitative approach was used. Ten external health education specialists and four study authors (internal evaluators) assessed the questions based on 38 criteria. While external evaluators found no statistically significant difference in criteria met between versions (p = 0.325), the study authors, who underwent standardization meetings, identified a statistically significant increase in the number of criteria met by ChatGPT-4-reviewed MCQs (p < 0.001). Descriptive statistics, Wilcoxon Signed-Rank Test, and Non-Metric Multidimensional Scaling were employed. The results showed that ChatGPT-4 demonstrated proficiency in modifying questions to reflect greater structural clarity and adherence to basic item-writing principles, resulting in questions with increased clarity and objectivity. However, it struggled to incorporate clinical reasoning and higher-order thinking when these were lacking, particularly given the non-optimized prompt used. Despite these limitations, AI's revisions were aligned with faculty quality standards, demonstrating its potential to complement faculty efforts, emphasizing the critical role of calibrated human expertise and effective prompt engineering, rather than replacement.

Bookmark

View Full Paper

Cite This Study

Iembo et al. (Wed,) studied this question.

synapsesocial.com/papers/6a08093ca487c87a6a40b21e https://doi.org/https://doi.org/10.1038/s41598-026-51988-9

Bookmark

View Full Paper