Objectives.To investigate the extent to which Large Language Models (LLMs) exhibit social bias based on protected patient attributes and to determine how design choices, such as architecture and prompting strategies, influence these observed biases in clinical decision support.Methods.We evaluated eight popular LLMs, including general-purpose and clinically trained models, across three standardized question-answering datasets using clinical vignettes.We employed red-teaming strategies to analyze the impact of demographics on LLM outputs and compared various prompting techniques, including Zero-shot and Chain of Thought.Results.Our experiments reveal various disparities across protected groups.Notably, larger models were not necessarily less biased, and medical fine-tuning did not consistently outperform general-purpose models.Furthermore, specific prompt phrasing significantly influenced bias patterns, whereas reflection-type approaches like Chain of Thought effectively reduced biased outcomes.Conclusions.LLMs demonstrate significant social biases in clinical scenarios that are influenced by model architecture and prompt engineering.These findings highlight the critical need for rigorous evaluation and enhancement of LLMs before their integration into clinical decision support systems.Consistent with prior studies, we call for additional scrutiny to ensure equity in AI-driven healthcare applications.All code and data are available at https://github.com/healthylaife/FairCDSLLM.Doi: 10.
Poulain et al. (Sun,) studied this question.