What question did this study set out to answer?

To assess the alignment of AI-generated multiple-choice questions (MCQs) with Bloom’s Taxonomy.

March 1, 2026Open Access

Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

Key Points

To assess the alignment of AI-generated multiple-choice questions (MCQs) with Bloom’s Taxonomy.
Evaluated five LLMs generating 60 MCQs each from an anatomy textbook.
Assessed each item using a 5-point Likert scale across Bloom’s cognitive levels.
Measured inter-rater reliability with weighted Cohen’s kappa.
Analyzed model performance and differences using the Kruskal-Wallis test.
Moderate to strong inter-rater reliability (kappa = 0.74–0.86).
Median scores above 4 for most cognitive levels, with analyzing scoring 3.5 for specific models.
Claude Sonnet 4 outperformed others at applying, analyzing, and evaluating/creating levels.
No significant difference in lower cognitive levels between models.

Abstract

Introduction While LLMs are used to generate medical and dental MCQs, their alignment with Bloom’s Taxonomy remains unexplored. Materials and Methods Five widely used LLMs, including ChatGPT-4o (OpenAI), Copilot Pro (Microsoft), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), and DeepSeek R1 (DeepSeek) were evaluated. Each model generated 60 MCQs (total 300) based on content from an oral and maxillofacial anatomy textbook across the five cognitive levels of Bloom’s Taxonomy. Two independent investigators assessed each item using a 5-point Likert scale for remembering, understanding, applying, analyzing, and evaluating/creating. Inter-rater reliability was measured using weighted Cohen’s kappa. Model performance and inter-model differences were analyzed using the Kruskal–Wallis test. Results Inter-rater reliability was moderate to strong (kappa = 0.74–0.86). Median scores for remembering, understanding, applying, and evaluating/creating were above 4 across all LLMs, while the analyzing level scored a median of 3.5 for ChatGPT-4o and DeepSeek R1. No significant difference was found between models in remembering and understanding levels (p > 0.05). Claude Sonnet 4 outperformed the other models at the applying, analyzing, and evaluating/creating levels (p = 0.01, 0.003, and 0.005, respectively). Within-model analysis showed that only Copilot Pro and Claude Sonnet 4 consistently aligned with Bloom’s cognitive levels across all categories. In contrast, ChatGPT-4o, DeepSeek R1, and Grok 3 performed significantly better at the lower cognitive levels (p = 0.00, 0.00, and 0.001, respectively). Conclusions All LLMs performed well at lower cognitive levels, while Claude Sonnet 4 achieved the highest alignment at higher-order levels.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Trang Thi Nguyen

Nguyen, Linh, T K

Hà Thị Nguyệt

Journals

PLoS ONE

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study