What question did this study set out to answer?

This research evaluates implicit biases in large language models (LLMs) during psychiatric diagnoses, aiming to assess their impact on accuracy and fairness in clinical settings.

March 10, 2026Open Access

Implicit Gender, Racial, and Ethnic Biases in Large Language Models: An Audit Study of Automated Psychiatric Diagnoses

Key Points

This research evaluates implicit biases in large language models (LLMs) during psychiatric diagnoses, aiming to assess their impact on accuracy and fairness in clinical settings.
Conducted a large-scale audit of six language models including general-purpose and medical-specific models.
Used 97 DSM-5 psychiatric training cases altered for gender and racial/ethnic identities across 39 demographic groups.
Assessed diagnostic accuracy, additional or missed diagnoses, and diagnostic reasoning language.
Identified GPT-4o as the most accurate model for deeper analysis.
GPT-4o provided at least one correct diagnosis in 82.8% of cases but added non-ground-truth diagnoses in 70.3% of cases, indicating overdiagnosis.
Accuracy varied by gender, outperforming for female patients and underperforming for non-binary individuals.
Biased diagnostic patterns emerged, such as cultural bereavement and antisocial behavior diagnosed exclusively in patients of color.
Terminology like 'disruptive' was frequently used for Black men.

Abstract

AbstractObjective To evaluate whether large language models (LLMs) exhibit implicit gender, racial, and ethnic biases when used to provide psychiatric diagnoses, and to understand how such biases may impact the accuracy and fairness of AI-assisted clinical decision support. Patients and Methods We conducted a large-scale audit of six LLMs, including general-purpose and medical-specific models, using 97 DSM-5 psychiatric training cases, conducted between October 1st 2023, and June 23rd 2025. Cases were systematically altered to suggest different gender and racial/ethnic identities—across 39 demographic groups—by changing names, pronouns, and descriptors while keeping clinical symptoms constant. We assessed diagnostic accuracy, additional or missed diagnoses, and language of diagnostic reasoning. GPT-4o emerged as the most accurate model and was selected for deeper analysis. Results GPT-4o accurately identified at least one correct diagnosis in 82.8% of cases but often added non-ground-truth diagnoses (70.3% of cases), suggesting a tendency to overdiagnose. Accuracy varied by gender, with higher performance for female patients and lower for non-binary individuals. Although overall accuracy did not differ significantly by race/ethnicity, biased diagnostic patterns emerged—for example, cultural bereavement and antisocial behavior were diagnosed exclusively in patients of color, and terms like "disruptive" were used more frequently for Black men. Conclusion Our findings demonstrate that LLMs reproduce and reinforce clinical biases even when symptoms are constant. AI-based tools must be audited not only for accuracy but also for bias in both diagnoses and explanatory language, especially when used in high-stakes mental health contexts.

Implicit Gender, Racial, and Ethnic Biases in Large Language Models: An Audit Study of Automated Psychiatric Diagnoses

Key Points

Abstract

Cite This Study