What question did this study set out to answer?

This research aims to classify risks associated with chatbot responses to patient inquiries, focusing on sentence-level analysis.

April 1, 2026Open Access

TMULLA at the NTCIR-18 MedNLP-CHAT Task

Key Points

This research aims to classify risks associated with chatbot responses to patient inquiries, focusing on sentence-level analysis.
Implemented automatic sentence segmentation
Conducted contextual risk annotation
Utilized threshold-based classification with traditional NLP models
Addressed risks based on specific sentences instead of entire responses
System lacked competitive performance in classifying ethical and legal risks
Single model approach failed to distinguish between various risk factors
Dataset limitations due to class imbalance affected generalization
Sentence-level analysis enhanced granularity but complicated handling of cross-sentence risks

Abstract

The NTCIR-18 MedNLP-CHAT RISK task evaluates the potential medical, ethical, and legal risks posed by chatbot-generated responses to patient inquiries. This study investigates a sentence-level risk classification approach to identify specific sentences within chatbot responses that contribute to risk assessment rather than treating entire responses as monolithic risk units. Our methodology involved automatic sentence segmentation, contextual risk annotation, and threshold-based classification, leveraging traditional natural language processing (NLP) models instead of large language models (LLMs) to ensure interpretability and stability. Despite the conceptual validity of our approach, our system did not perform competitively, particularly in ethical and legal risk classification. A key limitation was using a single model for all risk types, which failed to capture the nuanced distinctions between medical, ethical, and legal risk factors. Additionally, dataset constraints and class imbalance (fewer than 30 positive samples per risk category) limited model generalization. While sentence-level annotation improved granularity, it introduced challenges in handling cross-sentence risk dependencies, where risks emerge from multi-sentence interactions rather than isolated statements. Our findings highlight the need for more advanced risk classification frameworks, incorporating sequence-aware models, domain-specific fine-tuning, and context-sensitive risk evaluation. We also discuss the cultural relativity of risk perception, emphasizing that risk assessments should account for jurisdictional differences in medical, legal, and ethical norms. Future research should explore hybrid NLP architectures, data augmentation techniques, and adaptive risk modeling to enhance chatbot safety and reliability in medical AI applications.

Mark Helpful

Bookmark

Relay

View Full Paper