What type of study is this?

September 5, 2025

AI-Powered Triage of Suicidal Ideation in Adolescents: A Comparative Evaluation of Large Language Models Using Synthetic Clinical Vignettes (Preprint)

Key Points

Large language models demonstrated variable performance in classifying suicide risk, with GPT-4o scoring the highest accuracy at 82.5%.
Quantitative analysis revealed significant challenges in identifying high-risk categories effectively, underscoring gaps in clinical reasoning.
Assessment included 40 synthetic clinical vignettes scored against the C-SSRS, providing a structured test of models' capabilities.
Findings highlight critical omissions in safety planning, indicating LLMs are not yet suitable for autonomous clinical decision-making.

Abstract

BACKGROUND Suicide risk assessment is essential but often limited by time, scalability, and subjective judgment. Large language models (LLMs) show promise in supporting psychiatric decision-making, yet their safety, accuracy, and reliability—especially in crisis contexts—remain underexplored. OBJECTIVE To evaluate the performance of leading Large Language Models (LLMs) in classifying suicide risk and generating clinically appropriate action plans for adolescent psychiatric cases presented through synthetic clinical vignettes. METHODS We developed 40 synthetic clinical vignettes depicting adolescents with varying levels of suicide risk, structured according to established clinical formulation principles. A gold standard for risk level, based on the Columbia-Suicide Severity Rating Scale (C-SSRS) framework, and corresponding clinical actions was established for each vignette by a panel of two board-certified child and adolescent psychiatrists. Three LLMs (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B) were prompted using a structured chain-of-thought methodology to classify risk and propose a detailed action plan. Performance was assessed using quantitative classification metrics (accuracy, precision, recall, F1-score) and qualitative thematic analysis of the generated action plans. RESULTS Quantitative analysis of risk classification revealed variable performance. GPT-4o achieved the highest accuracy (82.5%), followed by Claude 3.5 Sonnet (75.0%) and Llama-3.1- 70B (67.5%). F1-scores demonstrated challenges in correctly identifying higher-risk categories, particularly for nuanced presentations of intent. Qualitative thematic analysis of the action plans identified consistent adherence to basic safety protocols (e.g., recommending emergency evaluation for high-risk cases). However, significant and critical failures were pervasive, including the omission of crucial inquiries about access to lethal means, failure to incorporate protective factors into planning, and the generation of clinically inappropriate therapeutic reassurance in a triage context. CONCLUSIONS While LLMs demonstrate a nascent ability to process clinical information for suicide risk assessment, significant deficits in clinical reasoning and safety planning persist. Their performance on idealized synthetic data suggests these models are not yet suitable for autonomous clinical decision-making. These findings underscore the imperative for rigorous, clinically-grounded evaluation frameworks and the development of human-in-the-loop systems to ensure patient safety in any future deployment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Masab Mansoor

Baylor Jack and Jane Hamilton Heart and Vascular Hospital

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AI-Powered Triage of Suicidal Ideation in Adolescents: A Comparative Evaluation of Large Language Models Using Synthetic Clinical Vignettes (Preprint)

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study