Artificial Intelligence (AI), particularly ChatGPT-4, offers promising applications in medical education, including multiple-choice question (MCQ) development. This study aimed to evaluate and compare the quality of 36 MCQs created by medical faculty with their versions reviewed by ChatGPT-4. A cross-sectional, quantitative approach was used. Ten external health education specialists and four study authors (internal evaluators) assessed the questions based on 38 criteria. While external evaluators found no statistically significant difference in criteria met between versions (p = 0.325), the study authors, who underwent standardization meetings, identified a statistically significant increase in the number of criteria met by ChatGPT-4-reviewed MCQs (p < 0.001). Descriptive statistics, Wilcoxon Signed-Rank Test, and Non-Metric Multidimensional Scaling were employed. The results showed that ChatGPT-4 demonstrated proficiency in modifying questions to reflect greater structural clarity and adherence to basic item-writing principles, resulting in questions with increased clarity and objectivity. However, it struggled to incorporate clinical reasoning and higher-order thinking when these were lacking, particularly given the non-optimized prompt used. Despite these limitations, AI's revisions were aligned with faculty quality standards, demonstrating its potential to complement faculty efforts, emphasizing the critical role of calibrated human expertise and effective prompt engineering, rather than replacement.
Iembo et al. (Wed,) studied this question.