OBJECTIVES: To conduct a scoping review of bias assessment in studies applying large language models (LLMs) to health data and to synthesize their prevailing conceptualization of bias. MATERIALS AND METHODS: Following PRISMA guidelines, we queried PubMed and Scopus. Two annotators screened titles, abstracts, and full texts for eligibility, calibrating their assessments throughout the process. For included studies, we extracted and summarized data on LLMs (name and version, development domain, open- or closed-sourced status, and commercial or academic origin), natural language processing tasks (task formulation, gold-standard dataset, evaluation metrics, prompting or fine-tuning strategies), and biases (type, assessment, and bias summary). RESULTS: Of the 1585 records retrieved, 76 papers met the eligibility criteria for full review. Among these, 59 reported identifying bias. Three major conceptualizations of bias emerged: behavioral output bias (nonstereotyping and stereotyping), predictive outcome bias, and representational bias. Studies generally adopted an observational approach (measuring bias using the existing dataset) or an experimental approach (altering prompts, eg, with different demographic information, and comparing outputs). DISCUSSION AND CONCLUSION: Behavioral output bias and predictive outcome bias, both of which emphasize parity, dominate existing studies. Whether evaluated against external accuracy or internal equality benchmarks, these approaches often assume that equal performance across groups is inherently desirable. Treating all disparities as bias risks conflating poor model behavior with real-world disparities, and researchers should remain aware of potential tradeoffs between parity and accuracy objectives. We introduce an integrated framework that combines parity and accuracy benchmarks and encourages transparent, context-aware interpretation of group differences.
He et al. (Wed,) studied this question.