The use of large language models (LLMs) to automate the generation of medical case-based multiple-choice questions (MCQs) is increasing, but their accuracy, reliability, and educational validity are still not well understood. This study in a comparative framework examined nine LLMs with four different prompting methods to evaluate LLM-produced MCQs for clinical coherence and readiness for assessment. A uniform evaluation pipeline was constructed to examine automatic text-similarity measures using automated metrics (BLEU, ROUGE, and METEOR), structural and parsability measures, and operational effectiveness (latency, cost, quality-efficiency ratios). Human validation was performed on the best-performing model and prompt combination (OpenBioLLM-70B with Chain-of-Thought) focusing on the model prompt that demonstrated the best linguistic fidelity and clinically aligned reasoning. Two clinical experts independently reviewed 88 items using a five-domain rubric covering appropriateness, clarity, relevance, distractor quality, and cognitive level. Results indicated significant variation across models and prompting strategies, with Chain-of-Thought yielding the best overall performance in comparison to other strategies. The OpenBioLLM-70B model demonstrated the best overall balance of quality, parsability, and efficiency, achieving a prompt template quality score of 90.4, a consistency score of 88.8, and a response time of 3.28 s, with a quality-per-dollar value of 134.11. The expert rating confirmed clinical alignment, but there was consensus that distractor quality needed further improvements. These results provide evidence that LLMs under optimal prompting conditions can reliably support MCQ generation and provide large-scale, cost-effective support for medical assessment production.
Building similarity graph...
Analyzing shared references across papers
Loading...
Somaiya Al Shuraiqi
Adhari AlZaabi
Abdulrahman AAl Abdulsalam
Machine Learning and Knowledge Extraction
Sultan Qaboos University
Building similarity graph...
Analyzing shared references across papers
Loading...
Shuraiqi et al. (Tue,) studied this question.
www.synapsesocial.com/papers/698d6de45be6419ac0d53219 — DOI: https://doi.org/10.3390/make8020041
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: