The NTCIR-18 MedNLP-CHAT RISK task evaluates the potential medical, ethical, and legal risks posed by chatbot-generated responses to patient inquiries. This study investigates a sentence-level risk classification approach to identify specific sentences within chatbot responses that contribute to risk assessment rather than treating entire responses as monolithic risk units. Our methodology involved automatic sentence segmentation, contextual risk annotation, and threshold-based classification, leveraging traditional natural language processing (NLP) models instead of large language models (LLMs) to ensure interpretability and stability. Despite the conceptual validity of our approach, our system did not perform competitively, particularly in ethical and legal risk classification. A key limitation was using a single model for all risk types, which failed to capture the nuanced distinctions between medical, ethical, and legal risk factors. Additionally, dataset constraints and class imbalance (fewer than 30 positive samples per risk category) limited model generalization. While sentence-level annotation improved granularity, it introduced challenges in handling cross-sentence risk dependencies, where risks emerge from multi-sentence interactions rather than isolated statements. Our findings highlight the need for more advanced risk classification frameworks, incorporating sequence-aware models, domain-specific fine-tuning, and context-sensitive risk evaluation. We also discuss the cultural relativity of risk perception, emphasizing that risk assessments should account for jurisdictional differences in medical, legal, and ethical norms. Future research should explore hybrid NLP architectures, data augmentation techniques, and adaptive risk modeling to enhance chatbot safety and reliability in medical AI applications.
Shao et al. (Fri,) studied this question.