What question did this study set out to answer?

This study evaluates if frontier large language models can accurately recognize clinical manifestations of Category A bioterrorism agents.

March 18, 2026Open Access

Frontier large language models and clinical recognition of Category A bioterrorism agents: a cross-sectional analysis

Key Points

This study evaluates if frontier large language models can accurately recognize clinical manifestations of Category A bioterrorism agents.
Constructed eighteen clinical vignettes, five for Category A agents and thirteen for closely related conditions.
Evaluated four large language models using prompts in new chat sessions to provide differential diagnoses.
Analyzed diagnostic accuracy using Cochran's Q test and McNemar’s exact tests for pairwise comparisons.
Claude Opus 4.5 and Gemini 3 Pro achieved 100.0% accuracy in identifying the most likely diagnoses for all vignettes.
ChatGPT 5.1 and Grok 4.1 correctly identified the most likely diagnoses in 88.9% of cases.
In the Category A agent subgroup, recognition accuracy ranged from 60.0% to 100.0%.

Abstract

Frontier large language models (LLMs) may affect biosecurity both by lowering barriers to biological weapons acquisition and by strengthening clinical preparedness. Physicians' timely recognition of the clinical manifestations of Category A bioterrorism agents is crucial but challenging, given their overlap with more common conditions. This study evaluated whether frontier LLMs can accurately recognise the clinical manifestations of Category A bioterrorism agents when presented with clinical vignettes. Eighteen expert-validated vignettes were constructed, five depicting Category A agents (anthrax, botulism, plague, smallpox, and viral haemorrhagic fever) and thirteen depicting different but clinically overlapping diagnoses. On 10 December 2025, four frontier LLMs (ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1) were each prompted, in new chat sessions, to provide exactly five differential diagnoses ranked from most to least likely for every vignette. The primary outcome was whether the correct diagnosis was identified as most likely; the secondary outcome was whether, if not ranked first, it appeared anywhere in the top five. Diagnostic accuracy across models was compared using Cochran's Q test, with McNemar's exact tests for pairwise comparisons. All the models generated responses for all vignettes. For the primary outcome, Claude Opus 4.5 and Gemini 3 Pro correctly identified the most likely diagnosis in all 18 cases (100.0%), whereas ChatGPT 5.1 and Grok 4.1 did so in 16 of 18 cases (88.9%); differences were not statistically significant (Q = 4.00, df = 3, p = 0.26). In the Category A agent subgroup, primary-outcome accuracy ranged from 60.0% to 100.0%. When incorrect, ChatGPT 5.1 consistently included the correct diagnosis within its differentials, whereas Grok 4.1 did not. Under idealised vignette conditions, frontier LLMs thus demonstrated high accuracy in recognising Category A bioterrorism syndromes, suggesting potential utility as diagnostic support and educational tools for strengthening clinical recognition and, with appropriate governance, health security.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Richard Armitage (Sat,) studied this question.

synapsesocial.com/papers/69ba41e04e9516ffd37a1c5d https://doi.org/https://doi.org/10.1080/23779497.2026.2643956

Bookmark

View Full Paper