DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models
Puntos clave
Achieving a DCHR of 0.50 indicates a moderate agreement level in depression detection outputs from LLMs.
Utilizing BDI-II criteria in our prompt-engineering methodology allowed for structured assessments of conversational cues.
The competition performance ranked second with an ADODL of 0.89, emphasizing strong model consistency and internal agreement.
Evaluation focused on cross-model agreement due to unavailability of established ground-truth labels for comparison.
Resumen
This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a prompt-engineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the official leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27.
DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models | Synapse
Also Consider
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: