What question did this study set out to answer?

This study aims to examine the effectiveness of different prompting strategies in generating medical MCQs using large language models.

February 12, 2026Open Access

Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study

Key Points

This study aims to examine the effectiveness of different prompting strategies in generating medical MCQs using large language models.
Evaluated nine large language models using four distinct prompting methods.
Constructed a uniform evaluation pipeline for automated and human assessment.
Measured text-similarity, structural quality, and operational effectiveness.
Conducted expert reviews of MCQs using a five-domain rubric.
Significant variation in performance across different models and prompting strategies was observed.
Chain-of-Thought prompting method yielded the best performance overall.
OpenBioLLM-70B model achieved a prompt template quality score of 90.4 and response time of 3.28 seconds.
Experts confirmed clinical alignment but noted distractor quality needed improvement.

Abstract

The use of large language models (LLMs) to automate the generation of medical case-based multiple-choice questions (MCQs) is increasing, but their accuracy, reliability, and educational validity are still not well understood. This study in a comparative framework examined nine LLMs with four different prompting methods to evaluate LLM-produced MCQs for clinical coherence and readiness for assessment. A uniform evaluation pipeline was constructed to examine automatic text-similarity measures using automated metrics (BLEU, ROUGE, and METEOR), structural and parsability measures, and operational effectiveness (latency, cost, quality-efficiency ratios). Human validation was performed on the best-performing model and prompt combination (OpenBioLLM-70B with Chain-of-Thought) focusing on the model prompt that demonstrated the best linguistic fidelity and clinically aligned reasoning. Two clinical experts independently reviewed 88 items using a five-domain rubric covering appropriateness, clarity, relevance, distractor quality, and cognitive level. Results indicated significant variation across models and prompting strategies, with Chain-of-Thought yielding the best overall performance in comparison to other strategies. The OpenBioLLM-70B model demonstrated the best overall balance of quality, parsability, and efficiency, achieving a prompt template quality score of 90.4, a consistency score of 88.8, and a response time of 3.28 s, with a quality-per-dollar value of 134.11. The expert rating confirmed clinical alignment, but there was consensus that distractor quality needed further improvements. These results provide evidence that LLMs under optimal prompting conditions can reliably support MCQ generation and provide large-scale, cost-effective support for medical assessment production.

Ask AI

Helpful

Bookmark

View Full Paper