Abstract Background Deep learning, particularly encoder-only transformer architectures, has demonstrated excellent performance in biomedical literature classification, facilitating evidence-based medicine, and knowledge synthesis. However, the opacity of these models’ decision-making processes limits their clinical interpretability, trustworthiness, and widespread adoption. Traditional explainable artificial intelligence methods, such as Shapley Additive Explanations (SHAP) and integrated gradients (IG), address this issue but often incur substantial computational overhead for text classification. Generative large language models may offer a novel approach to generating interpretable, context-aware explanations as autonomous agents. Objective As a proof-of-concept, the study aimed to investigate the effectiveness of GPT-4o as a standalone, end-to-end perturbation-based explainer for a BioLinkBERT text classifier. We compared its explanations against the SHAP partition explainer and IG as established baselines in terms of explanation faithfulness and semantic alignment. Methods A stratified sample of 200 studies from the McMaster Premium Literature Service (PLUS) and Clinical Hedges databases was classified by a fine-tuned BioLinkBERT model for methodological rigor. The sampling specifically over-represented difficult, low-confidence predictions to rigorously test the explainers, with an equal number of studies sampled from each probability decile predicted by BioLinkBERT. GPT-4o, SHAP, and IG generated token-level feature attributions across a robust feature space of 80,901 tokens. GPT-based explanations were derived through a sophisticated, iterative masking perturbation workflow under 2 prompting schemes (token indices vs explicit subword tokens). Explanations were evaluated using a rank-based, modified area over the perturbation curve (AOPC), pairwise correlation analyses, and qualitative assessment of feature importance. Results Among the 200 studies, 80,901 tokens were included, and feature attributions were generated by the 4 explainers (6369 unique tokens). SHAP (AOPC 0.222, 95% CI 0.200-0.244) and IG (AOPC 0.225, 95% CI 0.202-0.247) provided consistent explanations, effectively identifying tokens relevant to study rigor (eg, “randomized” and “blind”). In contrast, despite evaluating a larger perturbation space, the GPT-4o prompting schemes did not achieve comparable faithfulness (AOPC 0.025-0.029) and produced divergent token attributions. Correlation analysis demonstrated moderate alignment between SHAP and IG (Pearson r= 0.367), whereas GPT-4o exhibited limited correlation (Pearson r ≤0.032) with the established baselines. Sensitivity analyses isolating only correctly classified instances yielded similar trends. Additionally, the iterative application programming interface calls required for GPT made it significantly more computationally intensive and costly to execute, whereas IG was the most temporally efficient. Conclusions Despite their advanced contextual capabilities, current generative large language models are limited when deployed as standalone perturbation explainers. The findings reveal that GPT-4o struggles to accurately synthesize mathematical feature importance through iterative masking, lacking the reliability of traditional explainable artificial intelligence frameworks. Future research could build upon this work and investigate specialized prompt engineering, whole-word recombination strategies, and hybrid frameworks.
Zhou et al. (Wed,) studied this question.