What question did this study set out to answer?

The study aims to evaluate how well conversational AI can recognize clinically meaningful state transitions in suicide-risk dialogue.

May 14, 2026Open Access

Evaluating Conversational AI Recognition of Clinically Meaningful State Transitions in Suicide-Risk Dialogue: A Multi-Tier Rubric Approach with Independent Mechanical Severity Triangulation

Read Full Paperexternally

Key Points

The study aims to evaluate how well conversational AI can recognize clinically meaningful state transitions in suicide-risk dialogue.
Introduced a multi-tier evaluation methodology combining scenario specification and per-turn reference annotation.
Implemented a three-tier scoring rubric for assessing AI responses at turn, run, and expert-review levels.
Computed a Mechanical Severity Score independently to assess evaluator consistency.
Between 49% to 77% of responses scored inadequate for transition recognition, based on rubric version.
The overall rate of inadequate responses was 57% across all evaluations.
The study highlights significant gaps in AI's ability to recognize nuanced indicators of suicide risk.

Abstract

Conversational artificial intelligence (AI) systems are increasingly deployed in contexts where users disclose, directly or indirectly, signals of suicide risk. Existing safety evaluation approaches typically rely on single-evaluator AI scoring, content-moderation classifiers focused on explicit unsafe text, or aggregate rubrics that do not measure recognition of clinically meaningful state transitions in indirect, compressed, denied, or socially smoothed language. This paper introduces a multi-tier clinical evaluation methodology combining (i) stage-aware scenario specification with per-turn reference annotation, (ii) a three-tier scoring rubric operating at turn, run, and expert-review levels, (iii) an independent Mechanical Severity Score (MSS) computed by a deterministic rule-based procedure from structured per-turn observation fields, executed independently of the holistic AI-derived run-level concern rating and providing a triangulating signal that surfaces evaluator drift, and (iv) an integrated routing and review pipeline with a model-blinded human expert review interface. The methodology is informed by multiple theoretical anchors including the Salient Distress Model of Suicide, the Narrative-Crisis Model, the Three-Step Theory of Suicide, and the Collaborative Assessment and Management of Suicidality (CAMS) framework. Across 2,644 reference-annotated marker turns within 1,759 multi-turn AI evaluation runs (1,679 complete) spanning 31+ active models on 53 scenarios, 49% to 77% of marker turns received responses scoring at or below the inadequate threshold for transition recognition, depending on rubric version (combined inadequate-response rate 57%). Keywords: AI safety, clinical evaluation, suicide-risk detection, conversational AI, large language models, clinical rubric, evaluator drift, mechanical severity score, Salient Distress Model

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Laura L. Walsh

Walsh University

Actions

Institutions

Walsh University

Metacomp Technologies (United States)

Walsh College

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating Conversational AI Recognition of Clinically Meaningful State Transitions in Suicide-Risk Dialogue: A Multi-Tier Rubric Approach with Independent Mechanical Severity Triangulation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider