Los puntos clave no están disponibles para este artículo en este momento.
ABSTRACT A common approach to quantifying neural text classifier interpretability is to calculate faithfulness metrics based on iteratively masking salient input tokens and measuring changes in the model prediction. We propose that this property is better described as “sensitivity to iterative masking,” and highlight pitfalls in using this measure for comparing text classifier interpretability. We show that iterative masking produces large variation in faithfulness scores between otherwise comparable Transformer encoder text classifiers. We then demonstrate that iteratively masked samples produce embeddings outside the distribution seen during training, resulting in unpredictable behavior. We further explore task‐specific considerations that undermine principled comparison of interpretability using iterative masking, including an underlying similarity to salience‐based adversarial attacks. A model's initialization‐specific behavior with respect to iterative masking can persist through fine‐tuning: for example, models specifically fine‐tuned from the fairness‐optimized foundation model FairBERTa consistently score lower on faithfulness than models fine‐tuned from RoBERTa, despite comparable classification performance. The opposite tendency is observed between BERT‐CDA and BERT, demonstrating an inconsistent impact from fairness optimization on faithfulness scores that is likely overwhelmed by initialization‐specific behavior. Our findings give insight into how these behaviors affect neural text classifiers, and provide guidance on how sensitivity to iterative masking should be interpreted.
Crothers et al. (Tue,) studied this question.