AbstractObjective To evaluate whether large language models (LLMs) exhibit implicit gender, racial, and ethnic biases when used to provide psychiatric diagnoses, and to understand how such biases may impact the accuracy and fairness of AI-assisted clinical decision support. Patients and Methods We conducted a large-scale audit of six LLMs, including general-purpose and medical-specific models, using 97 DSM-5 psychiatric training cases, conducted between October 1st 2023, and June 23rd 2025. Cases were systematically altered to suggest different gender and racial/ethnic identities—across 39 demographic groups—by changing names, pronouns, and descriptors while keeping clinical symptoms constant. We assessed diagnostic accuracy, additional or missed diagnoses, and language of diagnostic reasoning. GPT-4o emerged as the most accurate model and was selected for deeper analysis. Results GPT-4o accurately identified at least one correct diagnosis in 82.8% of cases but often added non-ground-truth diagnoses (70.3% of cases), suggesting a tendency to overdiagnose. Accuracy varied by gender, with higher performance for female patients and lower for non-binary individuals. Although overall accuracy did not differ significantly by race/ethnicity, biased diagnostic patterns emerged—for example, cultural bereavement and antisocial behavior were diagnosed exclusively in patients of color, and terms like "disruptive" were used more frequently for Black men. Conclusion Our findings demonstrate that LLMs reproduce and reinforce clinical biases even when symptoms are constant. AI-based tools must be audited not only for accuracy but also for bias in both diagnoses and explanatory language, especially when used in high-stakes mental health contexts.
Pendse et al. (Sun,) studied this question.