Background: Organizations evaluating multi-agent LLM systems for clinical use rely on prompt-level safety instructions to constrain these systems. These instructions are typically validated on synthetic benchmarks. Whether such validation transfers to real EHR data has not been tested at scale. Methods: I evaluated four prompt-level safety instruction sets (NIST AI RMF, CHAI Blueprint, WHO AI Ethics, GAIF-4) across four model families (GPT-4o-mini, Llama 4 Maverick, Claude Sonnet 4.6, GPT-OSS-120b) on 400 synthetic vignettes and 400 ICU summaries (200 MIMIC-IV; 200 from eICU CRD). Total experimental scale: approximately 20,700 pipeline trials plus 1,550 post-hoc severity assessments. Primary outcome: Emergent Misinformation Rate (EMR; distinct from Electronic Medical Record). LLM-extraction validation directly tested and did not support the verbosity confound hypothesis (d=+1.20, p<0.001). Bonferroni applied across six real-EHR comparisons (alpha=0.0083). Results: On synthetic data, the WHO-derived operationalization tested reduced GPT-4o-mini EMR by 48 percent. On real MIMIC-IV (n=200), the same operationalization significantly increased EMR for GPT-4o-mini (+32.4%, d=+0.58, p<0.001), Claude Sonnet 4.6 (+9.9%, d=+0.34, p=0.0006), and GPT-OSS-120b (+8.6%, d=+0.49, p<0.001); Llama 4 showed no significant change. The GPT-4o-mini reversal replicated on eICU (+41.6%, d=+0.68, p<0.001); the Llama 4 eICU result was directional only (+10.3%, p=0.044, did not survive Bonferroni correction). No operationalization tested produced significant EMR reduction in any of the six real-EHR comparisons. Conclusions: Synthetic-validated prompt-level safety instructions either reversed direction (GPT-4o-mini under the WHO-derived operationalization tested, replicated on eICU) or substantially overestimated benefit on real ICU data (Claude, GPT-OSS-120b) for the specific operationalizations tested; no model showed governance benefit on real data across the six real-EHR comparisons. Healthcare organizations evaluating or piloting prompt-level safety instructions on multi-agent clinical LLM systems should empirically validate these operationalizations on real clinical data with model-specific testing before deployment. Independent research. Does not represent the views, policies, or endorsement of Blue Shield of California. No proprietary data used. The author developed GAIF-4, one of the four prompt-level safety instruction sets evaluated; this is disclosed as a competing interest.
Aman sharma (Sat,) studied this question.