October 18, 2025Open Access

Evaluating LLM Assisted Qualitative Analysis in Medical Education Research: A Comparison of Human and AI-Generated Thematic Coding (Preprint)

Key Points

GPT-4o achieved a 96% mean agreement with human codes in the deductive condition, illustrating its effective use.
AI generated 137 initial codes, with 31% aligning with human codes, indicating strengths and weaknesses of LLMs.
A roadmap proposes combining human and AI efforts for qualitative analysis, emphasizing human context and AI scalability.
Targeted human oversight allowed LLMs to reliably apply existing codebooks, marking potential for scalable qualitative methods.

Abstract

BACKGROUND While LLM-assisted qualitative analysis could improve the efficiency and scalability of feedback-driven curricular refinement in medical education, how to best to leverage LLMs for qualitative analysis while ensuring quality outputs remains an open question. Prior work has demonstrated the feasibility of using LLMs for inductive and deducting coding tasks, but more needs to be known about how LLM-assisted thematic coding can best be deployed in a medical education context to maximize its strengths and guard against weaknesses. OBJECTIVE The objective of our study was to evaluate LLM performance in inductive code generation and deductive application of a human codebook using a student focus-group transcript to propose a framework for collaborating with AI in qualitative analysis. METHODS The qualitative data for this study consisted of a 1‑hour focus group with four second‑year medical students discussing a required AI‑driven clinical‑scenario tool (2‑Sigma). Three human coders conducted an inductive thematic analysis. For the same transcript, GPT‑4o generated inductive codes and applied the human codebook deductively. The researchers compared the alignment between the AI inductive codes and the human consensus codebook using three categories: Agreement, Reasonable Alternative, and Not Reasonable. Interrater reliability of AI deductive coding was evaluated using percent agreement and Cohen’s κ, with textual audits of discrepancies, including "misses" (failed to apply appropriate codes) and "misfires" (inappropriately applied codes). All analysis took place between February and July 2025. RESULTS In the inductive condition, GPT‑4o generated 137 initial codes, of which 31% (n=43) demonstrated Agreement with human codes, 26% (n=36) represented Reasonable Alternatives, and 42% (n=58) were classified as Not Reasonable. In the deductive condition, the mean percent agreement for AI application of human codes was 96% (SD 4%, range 79–100%) and mean κ was 0.71 (SD 0.26, range 0–1.00). Of all coding decisions, there were 57 misfires (2%) and 28 misses (1%); common patterns included over‑interpretation of tone, failure to recognize continued ideas across excerpts, and difficulty distinguishing hypothetical vs experienced features. Based on our findings, we suggest a roadmap that retains human interpretive control while leveraging AI scalability: humans first develop a contextually grounded codebook through inductive analysis, then use AI both as a creative partner to surface alternative codes and as a tool to apply the validated codebook across the dataset. CONCLUSIONS With targeted human oversight, an LLM can reliably apply an existing codebook and propose additional inductive codes, offering a scalable adjunct for qualitative analysis in medical education.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper