What does this research mean for the field?

The automated VERA-MH safety evaluation, utilizing a large language model judge, strongly aligns with expert clinical consensus in rating AI chatbot responses to suicide risk. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to validate the VERA-MH safety evaluation for detecting suicide risk in AI chatbots by comparing it to expert human clinician ratings.

June 10, 2026Open Access

AI Chatbot Suicide Risk Detection and Response: Human Validation of the Open-Source VERA-MH Safety Evaluation (Preprint)

Leer artículo completoexternamente

Puntos clave

This study aims to validate the VERA-MH safety evaluation for detecting suicide risk in AI chatbots by comparing it to expert human clinician ratings.
Simulated conversations were created between LLM-based user-agents and AI chatbots.
Licensed mental health clinicians rated these conversations based on a scoring rubric from VERA-MH for safety levels.
Alignment of ratings was examined among clinicians, clinician consensus, and LLM judges.
Clinicians showed strong inter-rater reliability in safety ratings (IRR: 0.77).
The LLM judge's ratings aligned well with clinician consensus (IRR: 0.81).
Results indicate a need for ongoing evaluation of AI chatbot safety and the reliability of VERA-MH.

Resumen

BACKGROUND: Millions of people now use leading generative AI tools (chatbots) for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The field currently lacks a validated, automated benchmark for determining AI chatbot safety in mental health, including for users at risk of suicide. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet this urgent need. OBJECTIVE: This human validation study examines alignment of the VERA-MH safety evaluation for AI chatbot suicide risk detection and response with safety ratings by expert human clinicians. METHODS: We simulated a large set of conversations between large language model (LLM)-based users ("user-agents") spanning a wide range of suicide risk levels and disclosure styles and general-purpose AI chatbots. Licensed mental health clinicians from Spring Health used a scoring rubric developed for VERA-MH to independently rate the simulated conversations for safe and unsafe chatbot behaviors. An LLM-based evaluator (the "judge") used the same scoring rubric to evaluate the same set of conversations. We then examined rating alignment across (a) individual clinicians, (b) clinician consensus and the LLM judge, and (c) different judge LLMs. We also examined clinicians' ratings of user-agent realism, suicide risk, and disclosure. RESULTS: Clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability IRR: 0.77), thus establishing a reliable clinical consensus reference. The LLM judge was strongly aligned with this clinical consensus reference (IRR: 0.81) when using the same scoring rubric. Ratings were stable across judge LLMs and evaluations. Clinicians' ratings of user-agent realism and how well the intended user-agent suicide risk and disclosure styles were reflected in the simulated conversations were mixed. CONCLUSIONS: For the potential mental health benefits of AI chatbots to be realized, attention to safety is paramount. Findings support the reliability of VERA-MH: an open-source, fully automated AI safety evaluation for suicide risk detection and response. These results reflect an earlier version of the benchmark, and as VERA-MH continues to evolve, external validation of updated versions will be an important next step. Future research directions include VERA-MH generalizability and robustness, as well as expanding to target other key areas of AI safety for mental health.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kate H Bentley

Springer Nature (Germany)

Luca Belli

Springer Nature (Germany)

Adam M. Chekroud

Springer Nature (Germany)

Journals

JMIR AI

Actions

Institutions

Harvard University

University of California, Berkeley

Yale University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AI Chatbot Suicide Risk Detection and Response: Human Validation of the Open-Source VERA-MH Safety Evaluation (Preprint)

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study