What question did this study set out to answer?

To evaluate the performance of Agentic AI in multilingual risk classification for medical chatbots.

April 1, 2026Open Access

IMNTPU at NTCIR-18 MedNLP-CHAT Task: Evaluating Agentic AI for Multilingual Risk Assessment in Medical Chatbots

Key Points

To evaluate the performance of Agentic AI in multilingual risk classification for medical chatbots.
Multilingual evaluation of Agentic AI for chatbot risk classification.
Integration of fine-tuned small models and optimized few-shot prompting with GPT-4o.
Majority and trust-weighted voting for multi-agent aggregation.
Agentic AI improves decision consistency in subjective tasks like ethical risk.
Limited gains observed in structured domains such as medical and legal assessment.
Japanese systems demonstrate the most stability in performance.

Abstract

The IMNTPU team presents a multilingual evaluation of Agentic AI for chatbot risk classification in the NTCIR-18 MedNLP-CHAT task. Our framework integrates fine-tuned small models, optimized few-shot prompting with GPT-4o, and multi-agent aggregation via majority and trust-weighted voting. Results show that Agentic AI enhances decision consistency, especially in subjective tasks like ethical risk, but yields limited gains in structured domains such as medical and legal assessment. Language-specific outcomes reveal that annotation quality and linguistic complexity jointly affect model performance, with Japanese systems showing the most stability. Confidence analysis highlights a decoupling between model certainty and accuracy, underscoring the need for adaptive trust and calibration strategies. Building on these insights, we propose a Trust-Guided Agentic AI architecture featuring self-consistency filtering, dynamic trust updating, and Chain-of-Thought prompting to further improve reliability in safety-critical AI systems.

Mark Helpful

Bookmark

Relay

View Full Paper