Large language models (LLMs) are transforming health care by supporting a range of administrative and clinical tasks; however, recent studies have raised concerns about their potential to exacerbate existing health care inequities. Traditional algorithmic auditing approaches fall short in addressing the unique challenges posed by LLMs, which process complex text-based inputs and generate human-like outputs. In this perspective, we examine current approaches for evaluating LLM bias in clinical settings, identifying key gaps in existing audit methodologies. We propose comprehensive guidelines for categorizing and detecting biases in LLM applications and illustrate their application through two real-world deployed systems — in-basket patient response drafting and mental health chatbots. Finally, we offer concrete recommendations for advancing LLM bias evaluation in a rapidly evolving technological landscape.
Building similarity graph...
Analyzing shared references across papers
Loading...
Irene Y. Chen
Emily Alsentzer
NEJM AI
Stanford University
University of California, Berkeley
University of California, San Francisco
Building similarity graph...
Analyzing shared references across papers
Loading...
Chen et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68c1d7ee54b1d3bfb60f9fd6 — DOI: https://doi.org/10.1056/aip2500015