What question did this study set out to answer?

This study aims to assess and synthesize the existing bias evaluations in studies utilizing large language models in health care settings.

June 17, 2026

Large language model biases in health care: a scoping review and call for an integrated assessment framework

Key Points

This study aims to assess and synthesize the existing bias evaluations in studies utilizing large language models in health care settings.
Conducted a scoping review following PRISMA guidelines, querying PubMed and Scopus.
Four annotators screened titles, abstracts, and full texts, maintaining calibration throughout.
Data on LLM characteristics, natural language processing tasks, and types of biases were extracted and summarized.
Of 1585 records retrieved, 76 studies met eligibility criteria, with 59 identifying biases.
Identified major types of biases included behavioral output bias, predictive outcome bias, and representational bias.
The review highlighted an integrated framework combining both accuracy and parity benchmarks for more robust bias assessments.

Abstract

OBJECTIVES: To conduct a scoping review of bias assessment in studies applying large language models (LLMs) to health data and to synthesize their prevailing conceptualization of bias. MATERIALS AND METHODS: Following PRISMA guidelines, we queried PubMed and Scopus. Two annotators screened titles, abstracts, and full texts for eligibility, calibrating their assessments throughout the process. For included studies, we extracted and summarized data on LLMs (name and version, development domain, open- or closed-sourced status, and commercial or academic origin), natural language processing tasks (task formulation, gold-standard dataset, evaluation metrics, prompting or fine-tuning strategies), and biases (type, assessment, and bias summary). RESULTS: Of the 1585 records retrieved, 76 papers met the eligibility criteria for full review. Among these, 59 reported identifying bias. Three major conceptualizations of bias emerged: behavioral output bias (nonstereotyping and stereotyping), predictive outcome bias, and representational bias. Studies generally adopted an observational approach (measuring bias using the existing dataset) or an experimental approach (altering prompts, eg, with different demographic information, and comparing outputs). DISCUSSION AND CONCLUSION: Behavioral output bias and predictive outcome bias, both of which emphasize parity, dominate existing studies. Whether evaluated against external accuracy or internal equality benchmarks, these approaches often assume that equal performance across groups is inherently desirable. Treating all disparities as bias risks conflating poor model behavior with real-world disparities, and researchers should remain aware of potential tradeoffs between parity and accuracy objectives. We introduce an integrated framework that combines parity and accuracy benchmarks and encourages transparent, context-aware interpretation of group differences.

Bookmark

Large language model biases in health care: a scoping review and call for an integrated assessment framework

Key Points

Abstract

Cite This Study