What question did this study set out to answer?

This review aims to analyze the risks and mitigation strategies related to hallucinations in large language models used in medical contexts.

April 11, 2026

Hallucinations of Large Language Models in Medical Environments: A Systematic Review of Risks, Detection, and Mitigation

Key Points

This review aims to analyze the risks and mitigation strategies related to hallucinations in large language models used in medical contexts.
Conducted a systematic literature review of relevant studies published from 2023 to 2025.
Evaluated risks, detection methods, and mitigation strategies for hallucinations in medical settings.
Developed a novel evaluative framework, CR², to guide risk-aware adoption of LLMs.
Confirmed that hallucination is inherent in autoregressive text generation under uncertainty.
Found that hybrid control integrating multiple strategies is crucial for trustworthy deployment of LLMs.
Identified open challenges, including the need for multilingual generalizability and operational governance.

Abstract

Large language models (LLMs) are demonstrating transformative potential in medical informatics, assisting with tasks ranging from diagnostic reasoning to patient communication. However, their propensity to generate confident yet unfounded outputs—termed hallucinations—poses significant risks to patient safety and clinical accountability. This paper presents a systematic literature review of research from 2023 to 2025, analyzing the risks, benchmarks, detection paradigms, and mitigation strategies associated with medical hallucinations. The paper synthesizes our findings into a novel evaluative framework, CR²(Capability × Reliability × Cost × Clinical Risk), designed to guide risk -aware adoption. Our analysis confirms that hallucination is a structural property of autoregressive text generation under uncertainty. Consequently, we argue that hybrid control—integrating retrieval grounding, verification mechanisms, calibrated generation, and human oversight—constitutes the most credible path toward trustworthy deployment. The review concludes by identifying critical open challenges, including the need for harm-weighted evaluation, multilingual generalisability, and operational governance mechanisms.

Ask AI

Helpful

Bookmark