What question did this study set out to answer?

This evaluation examines the effectiveness of synthetic benchmark-validated prompt-level safety instructions in clinical large language models (LLMs) using real electronic health record (EHR) data.

May 19, 2026Open Access

Synthetic benchmark validation does not transfer to real-data behavior for the tested prompt-level safety operationalizations in clinical LLMs: a multi-model multi-institution evaluation

Key Points

This evaluation examines the effectiveness of synthetic benchmark-validated prompt-level safety instructions in clinical large language models (LLMs) using real electronic health record (EHR) data.
Evaluated four prompt-level safety instruction sets across four model families on synthetic vignettes and real ICU summaries (N=400 synthetic; N=400 real).
Conducted approximately 20,700 pipeline trials and 1,550 post-hoc assessments to measure EMR.
Applied Bonferroni correction across six real-EHR comparisons to minimize false positives.
WHO-derived prompt-level safety instruction reduced EMR by 48% on synthetic data for GPT-4o-mini but increased EMR by 32.4% on real MIMIC-IV data (d=+0.58, p<0.001).
Similar EMR increases were observed in Claude Sonnet 4.6 (+9.9%, d=+0.34, p=0.0006) and GPT-OSS-120b (+8.6%, d=+0.49, p<0.001).
No operationalization significantly reduced EMR across the six real-EHR comparisons conducted.

Abstract

Background: Organizations evaluating multi-agent LLM systems for clinical use rely on prompt-level safety instructions to constrain these systems. These instructions are typically validated on synthetic benchmarks. Whether such validation transfers to real EHR data has not been tested at scale. Methods: I evaluated four prompt-level safety instruction sets (NIST AI RMF, CHAI Blueprint, WHO AI Ethics, GAIF-4) across four model families (GPT-4o-mini, Llama 4 Maverick, Claude Sonnet 4.6, GPT-OSS-120b) on 400 synthetic vignettes and 400 ICU summaries (200 MIMIC-IV; 200 from eICU CRD). Total experimental scale: approximately 20,700 pipeline trials plus 1,550 post-hoc severity assessments. Primary outcome: Emergent Misinformation Rate (EMR; distinct from Electronic Medical Record). LLM-extraction validation directly tested and did not support the verbosity confound hypothesis (d=+1.20, p<0.001). Bonferroni applied across six real-EHR comparisons (alpha=0.0083). Results: On synthetic data, the WHO-derived operationalization tested reduced GPT-4o-mini EMR by 48 percent. On real MIMIC-IV (n=200), the same operationalization significantly increased EMR for GPT-4o-mini (+32.4%, d=+0.58, p<0.001), Claude Sonnet 4.6 (+9.9%, d=+0.34, p=0.0006), and GPT-OSS-120b (+8.6%, d=+0.49, p<0.001); Llama 4 showed no significant change. The GPT-4o-mini reversal replicated on eICU (+41.6%, d=+0.68, p<0.001); the Llama 4 eICU result was directional only (+10.3%, p=0.044, did not survive Bonferroni correction). No operationalization tested produced significant EMR reduction in any of the six real-EHR comparisons. Conclusions: Synthetic-validated prompt-level safety instructions either reversed direction (GPT-4o-mini under the WHO-derived operationalization tested, replicated on eICU) or substantially overestimated benefit on real ICU data (Claude, GPT-OSS-120b) for the specific operationalizations tested; no model showed governance benefit on real data across the six real-EHR comparisons. Healthcare organizations evaluating or piloting prompt-level safety instructions on multi-agent clinical LLM systems should empirically validate these operationalizations on real clinical data with model-specific testing before deployment. Independent research. Does not represent the views, policies, or endorsement of Blue Shield of California. No proprietary data used. The author developed GAIF-4, one of the four prompt-level safety instruction sets evaluated; this is disclosed as a competing interest.

Synthetic benchmark validation does not transfer to real-data behavior for the tested prompt-level safety operationalizations in clinical LLMs: a multi-model multi-institution evaluation

Key Points

Abstract

Cite This Study