Conversational artificial intelligence (AI) systems are increasingly deployed in contexts where users disclose, directly or indirectly, signals of suicide risk. Existing safety evaluation approaches typically rely on single-evaluator AI scoring, content-moderation classifiers focused on explicit unsafe text, or aggregate rubrics that do not measure recognition of clinically meaningful state transitions in indirect, compressed, denied, or socially smoothed language. This paper introduces a multi-tier clinical evaluation methodology combining (i) stage-aware scenario specification with per-turn reference annotation, (ii) a three-tier scoring rubric operating at turn, run, and expert-review levels, (iii) an independent Mechanical Severity Score (MSS) computed by a deterministic rule-based procedure from structured per-turn observation fields, executed independently of the holistic AI-derived run-level concern rating and providing a triangulating signal that surfaces evaluator drift, and (iv) an integrated routing and review pipeline with a model-blinded human expert review interface. The methodology is informed by multiple theoretical anchors including the Salient Distress Model of Suicide, the Narrative-Crisis Model, the Three-Step Theory of Suicide, and the Collaborative Assessment and Management of Suicidality (CAMS) framework. Across 2,644 reference-annotated marker turns within 1,759 multi-turn AI evaluation runs (1,679 complete) spanning 31+ active models on 53 scenarios, 49% to 77% of marker turns received responses scoring at or below the inadequate threshold for transition recognition, depending on rubric version (combined inadequate-response rate 57%). Keywords: AI safety, clinical evaluation, suicide-risk detection, conversational AI, large language models, clinical rubric, evaluator drift, mechanical severity score, Salient Distress Model
Building similarity graph...
Analyzing shared references across papers
Loading...
Laura L. Walsh
Walsh University
Walsh University
Metacomp Technologies (United States)
Walsh College
Building similarity graph...
Analyzing shared references across papers
Loading...
Laura L. Walsh (Thu,) studied this question.
synapsesocial.com/papers/6a056714a550a87e60a1f013 — DOI: https://doi.org/10.5281/zenodo.20147000
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: