BACKGROUND: Large language models (LLMs) are increasingly used for clinical information retrieval and decision support, yet comparative performance on pharmacy board examination-style content across specialties remains incompletely characterized. METHODS: We evaluated 15 LLMs using 145 publicly available Board of Pharmacy Specialties (BPS) certification practice questions spanning 14 specialty domains. Questions were entered using a standardized prompt without additional prompt engineering. Model responses were scored against BPS-posted answer keys. Overall and specialty-level accuracy were summarized descriptively. Differences among LLMs were tested using Cochran's Q with Bonferroni-adjusted McNemar pairwise comparisons when appropriate, and LLMs were assessed using their default user-facing settings. RESULTS: Across all LLMs, mean accuracy was 86.2% (standard deviation SD, 3.5%), corresponding to an average of 125/145 items answered correctly. Accuracy ranged from 79.3% (95% confidence interval CI, 72.6%-86%) for Perplexity AI to 91.7% (95% CI, 87.2%-96.3%) for Microsoft Copilot (GPT-5). Overall performance differed significantly across LLMs (Cochran's Q = 46.262; df = 14; p < 0.001). After Bonferroni adjustment, Microsoft Copilot (GPT-5), Google Gemini 2.5 Flash, and OpenAI o3 (Reasoning) outperformed Perplexity AI (p < 0.001). Microsoft Copilot (GPT-5) also outperformed an earlier version of Microsoft Copilot (GPT-4.1) (p < 0.001). Specialty-level heterogeneity was generally limited, with significant model differences observed in Solid Organ Transplantation Pharmacy and Nuclear Pharmacy. CONCLUSIONS: LLMs demonstrated high accuracy on BPS certification practice questions, with limited variability across LLMs and select specialty domains. These findings support continued evaluation of LLMs for potential use in pharmacy practice and clinical decision support, emphasizing the need for domain-specific validation and ongoing monitoring as LLMs evolve.
Collins et al. (Fri,) studied this question.