Frontier large language models (LLMs) may affect biosecurity both by lowering barriers to biological weapons acquisition and by strengthening clinical preparedness. Physicians' timely recognition of the clinical manifestations of Category A bioterrorism agents is crucial but challenging, given their overlap with more common conditions. This study evaluated whether frontier LLMs can accurately recognise the clinical manifestations of Category A bioterrorism agents when presented with clinical vignettes. Eighteen expert-validated vignettes were constructed, five depicting Category A agents (anthrax, botulism, plague, smallpox, and viral haemorrhagic fever) and thirteen depicting different but clinically overlapping diagnoses. On 10 December 2025, four frontier LLMs (ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1) were each prompted, in new chat sessions, to provide exactly five differential diagnoses ranked from most to least likely for every vignette. The primary outcome was whether the correct diagnosis was identified as most likely; the secondary outcome was whether, if not ranked first, it appeared anywhere in the top five. Diagnostic accuracy across models was compared using Cochran's Q test, with McNemar's exact tests for pairwise comparisons. All the models generated responses for all vignettes. For the primary outcome, Claude Opus 4.5 and Gemini 3 Pro correctly identified the most likely diagnosis in all 18 cases (100.0%), whereas ChatGPT 5.1 and Grok 4.1 did so in 16 of 18 cases (88.9%); differences were not statistically significant (Q = 4.00, df = 3, p = 0.26). In the Category A agent subgroup, primary-outcome accuracy ranged from 60.0% to 100.0%. When incorrect, ChatGPT 5.1 consistently included the correct diagnosis within its differentials, whereas Grok 4.1 did not. Under idealised vignette conditions, frontier LLMs thus demonstrated high accuracy in recognising Category A bioterrorism syndromes, suggesting potential utility as diagnostic support and educational tools for strengthening clinical recognition and, with appropriate governance, health security.
Richard Armitage (Sat,) studied this question.