BACKGROUND Assessing chatbot responses across three domains: medical, ethical, and legal, is an essential task in ensuring the safe use of AI in healthcare. While advancements in the use of LLMs show significant improvements in evaluating question-answer datasets through multiple-choice medical exams, existing systems utilize general LLMs without applying specialized domain knowledge, relying on standardized instructions without integrating real-world information and implementing ensemble methods such as majority voting failing to resolve the disagreement with other agents, resulting in misclassification and challenges in assessing risks. OBJECTIVE This study aims to design, develop, and evaluate a synergistic approach for assessing risks associated with chatbot responses using multi-assessment and multi-professional agents. METHODS We designed and developed an approach that consists of a multi-assessment, multi-professional agent approach, specifically Initial Assessment (MA1), which internalizes three roles and provides an initial risk estimation; Final Assessment (MA3), which aims to reach a final consensus based on the previous assessments (MA1 and MA2), with each utilizing one LLM. Verification Assessment (MA2) incorporates a multi-professional agent for each risk domain (medical, ethical, legal). The proposed approach was evaluated using different systems: baseline, enhanced prompt, embedding-based search, and RAG, applying various metrics such as macro F1-score, joint accuracy, and delta (Δ). RESULTS The proposed approach demonstrates a significant improvement over existing systems in assessing the risk of chatbot responses in the ethical risk domain with a 0.25 increase and the legal risk domain with a 0.10 increase. It indicates that the proposed approach applied in systems with external knowledge helps improve risk estimation. However, the medical domain remains a challenge but shows slight improvements with a 0.07 increase. CONCLUSIONS A multi-assessment and multi-professional agent approach is an effective approach for assessing risk estimation in chatbot responses. These highlight the potential use of the approach and develop a specialized LLM for more robust and contextually grounded risk estimation.
Tamayo et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: