What is the clinical evidence from this study?

Study design: Other. Population: Psychiatric disorders (Depression, PTSD) (n=306). Intervention: Med-PaLM 2 vs. Human clinical raters. Primary outcome: Prediction of depression scores (PHQ-8) (p=0.23).

August 3, 2023Open Access

The Capability of Large Language Models to Measure Psychiatric Functioning

Key Result

Med-PaLM 2 predicted depression scores from clinical interviews with 80-84% accuracy, yielding scores statistically indistinguishable from human clinical raters.

Structured PICO

Does Med-PaLM 2 accurately predict psychiatric functioning and diagnoses from clinical interviews and case descriptions compared to human raters?

Population

n=145 depression assessments (PHQ-8) and n=115 PTSD assessments (PCL-C) from the DAIC-WOZ corpus, plus n=46 clinical case studies from DSM-5 Clinical Cases across high prevalence/high comorbidity disorders.

Intervention

Med-PaLM 2 (a large language model trained on medical knowledge) prompted to extract estimated clinical scores and diagnoses from patient interviews and clinical descriptions.

Comparator

Human clinical raters and ground truth DSM-5 diagnoses.

Outcome

Accuracy of predicting depression (PHQ-8) and PTSD (PCL-C) scores and clinical cutoffs, and accuracy of diagnostic categorization.

Med-PaLM 2 demonstrates the emergent capability to accurately assess depression severity from clinical interviews, though performance varies across other psychiatric conditions like PTSD.

Main Result

Absolute Event Rate: 8.5% vs 7.94%

p-value: p=0.23

Limitations

Relatively small datasets and limited use cases
Limited to English only
Demographically narrow data sources for testing
Inconsistent performance in identifying comorbidities and diagnostic modifiers

Abstract

The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.

Bookmark

View Full Paper