What question did this study set out to answer?

The aim is to compare the performance of various large language models on pharmacy specialty certification examination practice questions.

May 1, 2026

Rise of the Machines: Comparing Performance of Artificial Intelligence Large Language Models on Pharmacy Specialty Certification Examination Practice Questions.

Key Points

The aim is to compare the performance of various large language models on pharmacy specialty certification examination practice questions.
Evaluated 15 LLMs using 145 BPS certification practice questions across 14 domains.
Model responses were scored against official answer keys with no additional prompt modifications.
Compared differences in accuracy among LLMs using Cochran's Q and Bonferroni-adjusted McNemar comparisons.
Mean accuracy across all LLMs was 86.2%, with performances ranging from 79.3% to 91.7%.
Microsoft Copilot (GPT-5) outperformed Perplexity AI (p < 0.001) and earlier Microsoft models (p < 0.001).
Significant differences were seen in certain specialties like Solid Organ Transplantation and Nuclear Pharmacy.

Abstract

BACKGROUND: Large language models (LLMs) are increasingly used for clinical information retrieval and decision support, yet comparative performance on pharmacy board examination-style content across specialties remains incompletely characterized. METHODS: We evaluated 15 LLMs using 145 publicly available Board of Pharmacy Specialties (BPS) certification practice questions spanning 14 specialty domains. Questions were entered using a standardized prompt without additional prompt engineering. Model responses were scored against BPS-posted answer keys. Overall and specialty-level accuracy were summarized descriptively. Differences among LLMs were tested using Cochran's Q with Bonferroni-adjusted McNemar pairwise comparisons when appropriate, and LLMs were assessed using their default user-facing settings. RESULTS: Across all LLMs, mean accuracy was 86.2% (standard deviation SD, 3.5%), corresponding to an average of 125/145 items answered correctly. Accuracy ranged from 79.3% (95% confidence interval CI, 72.6%-86%) for Perplexity AI to 91.7% (95% CI, 87.2%-96.3%) for Microsoft Copilot (GPT-5). Overall performance differed significantly across LLMs (Cochran's Q = 46.262; df = 14; p < 0.001). After Bonferroni adjustment, Microsoft Copilot (GPT-5), Google Gemini 2.5 Flash, and OpenAI o3 (Reasoning) outperformed Perplexity AI (p < 0.001). Microsoft Copilot (GPT-5) also outperformed an earlier version of Microsoft Copilot (GPT-4.1) (p < 0.001). Specialty-level heterogeneity was generally limited, with significant model differences observed in Solid Organ Transplantation Pharmacy and Nuclear Pharmacy. CONCLUSIONS: LLMs demonstrated high accuracy on BPS certification practice questions, with limited variability across LLMs and select specialty domains. These findings support continued evaluation of LLMs for potential use in pharmacy practice and clinical decision support, emphasizing the need for domain-specific validation and ongoing monitoring as LLMs evolve.

Bookmark

Cite This Study

Collins et al. (Fri,) studied this question.

synapsesocial.com/papers/69f443cb967e944ac5566ea1 https://doi.org/https://doi.org/10.1002/jac5.70215

Bookmark