What question did this study set out to answer?

This study aims to benchmark the safety and performance of large language models (LLMs) in clinical decision-making under ethical and dynamic conditions.

April 29, 2026Open Access

GALATEA II: Benchmarking LLM Safety in Clinical Simulation. Behavioural Safety and Ethical Robustness of Large Language Models in a Multi-Agent ICU Decision Support Architecture

Key Points

This study aims to benchmark the safety and performance of large language models (LLMs) in clinical decision-making under ethical and dynamic conditions.
Conducted a large-scale benchmark of 12 LLMs with over 58,000 automated consultation logs in a multi-agent architecture.
Evaluated models based on accuracy and ethical behavior through two independent LLM-based judges.
Utilized a structured dataset with four quality criteria and evaluated across multiple ethical profiles and clinical domains.
Model accuracy ranged from 11.1% to 77.0%, with general-purpose models outperforming specialized ones (11.1–13.2% vs. 70–77%).
Ethics-memory dissociation observed, with memory failure rates increasing from 1.8% in routine scenarios to 28.2% under ethical pressure (15.7× increase).
Sycophancy behavior increased from 8.1% to 40.5% under targeted social pressure, highlighting vulnerabilities.

Abstract

Background. Large language models (LLMs) are increasingly proposed as clinical decision support tools in intensive care. However, most existing evaluations focus on static medical knowledge recall and do not account for model behavior under dynamic clinical dialogue, social pressure, or ethical conflict. The safety of LLM-based systems under these conditions remains insufficiently characterized. Methods. We conducted a large-scale comparative benchmark of 12 language models across more than 58, 000 automated consultation logs generated within a three-role multi-agent architecture (Clinician — Expert — Judge). Clinical inference was performed locally on consumer-grade hardware. Evaluation was conducted by two independent LLM-based judges: a local GPT-OSS model (35, 149 evaluations) and the Gemini API (23, 300 evaluations). Testing covered 3 ethical profiles, 4 case types, and 9 clinical domains. The final GOLDEN dataset (n=3, 680) was assembled using four quality criteria with priority given to the stricter judge. Results. Model accuracy ranged from 11. 1% to 77. 0%. Specialized medical models systematically underperformed general-purpose models (11. 1–13. 2% vs. 70–77%). We identified an ethics-memory dissociation phenomenon: memory failure rates increased from 1. 8% in routine scenarios to 28. 2% under ethical pressure (15. 7×). Sycophancy under targeted pressure reached 40. 5% vs. 8. 1% in standard scenarios. The most restrictive ethical profile (strictᵥ1) demonstrated the lowest accuracy (45. 9%) — a paranoia-overfitting phenomenon quantified by the proposed GQI metric. Conclusions. Clinical LLM safety is determined not by medical specialization or restrictiveness of ethical constraints, but by architectural resilience to social manipulation and memory integrity under pressure. The proposed case type taxonomy and GQI metric may serve as a methodological foundation for standardized clinical AI validation.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper