BACKGROUND Suicide risk assessment is essential but often limited by time, scalability, and subjective judgment. Large language models (LLMs) show promise in supporting psychiatric decision-making, yet their safety, accuracy, and reliability—especially in crisis contexts—remain underexplored. OBJECTIVE To evaluate the performance of leading Large Language Models (LLMs) in classifying suicide risk and generating clinically appropriate action plans for adolescent psychiatric cases presented through synthetic clinical vignettes. METHODS We developed 40 synthetic clinical vignettes depicting adolescents with varying levels of suicide risk, structured according to established clinical formulation principles. A gold standard for risk level, based on the Columbia-Suicide Severity Rating Scale (C-SSRS) framework, and corresponding clinical actions was established for each vignette by a panel of two board-certified child and adolescent psychiatrists. Three LLMs (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B) were prompted using a structured chain-of-thought methodology to classify risk and propose a detailed action plan. Performance was assessed using quantitative classification metrics (accuracy, precision, recall, F1-score) and qualitative thematic analysis of the generated action plans. RESULTS Quantitative analysis of risk classification revealed variable performance. GPT-4o achieved the highest accuracy (82.5%), followed by Claude 3.5 Sonnet (75.0%) and Llama-3.1- 70B (67.5%). F1-scores demonstrated challenges in correctly identifying higher-risk categories, particularly for nuanced presentations of intent. Qualitative thematic analysis of the action plans identified consistent adherence to basic safety protocols (e.g., recommending emergency evaluation for high-risk cases). However, significant and critical failures were pervasive, including the omission of crucial inquiries about access to lethal means, failure to incorporate protective factors into planning, and the generation of clinically inappropriate therapeutic reassurance in a triage context. CONCLUSIONS While LLMs demonstrate a nascent ability to process clinical information for suicide risk assessment, significant deficits in clinical reasoning and safety planning persist. Their performance on idealized synthetic data suggests these models are not yet suitable for autonomous clinical decision-making. These findings underscore the imperative for rigorous, clinically-grounded evaluation frameworks and the development of human-in-the-loop systems to ensure patient safety in any future deployment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Masab Mansoor
Baylor Jack and Jane Hamilton Heart and Vascular Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Masab Mansoor (Tue,) studied this question.
synapsesocial.com/papers/68bb3a2b2b87ece8dc954a62 — DOI: https://doi.org/10.2196/preprints.82288