The comprehensibility and human interpretation of classification models are crucial in many applications, such as decision support systems and knowledge discovery, where explanations drive action. However, the presence of class label noise, widespread in real-life data, can significantly impact the performance and interpretability of data models. This study addresses the problem of interpretability robustness by examining the impact of class label noise on rule-learning models – the models extensively used for discovering transparent, human-readable interpretations of hidden data patterns and decision logic. Our empirical results demonstrate that while model performance may remain stable under increasing label noise, the consistency of explainable model rules suffers significantly. As a result, we uncover a novel and critical phenomenon – interpretation drift – where model explanations change substantially under label noise, even when predictive performance remains stable. This phenomenon can directly impact AI-informed decisions, but is not detectable through conventional performance metrics and therefore poses significant risks in real-world applications reliant on AI explanations. Our findings emphasize the need for standardized, interpretability-aware robustness metrics in the development of trustworthy explainable AI.
Raikovskaia et al. (Tue,) studied this question.