Explainable AI (XAI) in clinician-facing clinical decision support (CDS) is increasingly promoted to enhance transparency, yet prior evidence suggests that explanations do not consistently improve clinical decision-making and may occasionally exacerbate errors. This critical systematic review and evidence map aimed to (i) synthesize human-centered evaluations of explainable clinician-facing CDS, and (ii) construct an evidence map linking explanation types, clinical tasks, evaluation settings, and outcome directions for decision quality, reliance calibration, and usability. Database searches were conducted in PubMed, Google Scholar, and Semantic Scholar through January 2026. Studies were included if they empirically evaluated an AI-based CDS system with an explanation condition, involved clinicians or trainees performing clinical decision tasks, and reported human-centered outcomes. Thirty-four studies met inclusion criteria. Data extraction, critical appraisal, and evidence mapping were performed, with effect directions coded as positive, mixed/null, or negative across outcome families. Included studies disproportionally used vignette, reader, or simulation paradigms rather than workflow-embedded deployments. Across larger controlled experiments, explanations frequently increased perceived trust and acceptance but did not reliably improve decision quality. In several large studies, explanations worsened diagnostic accuracy when AI advice was incorrect or biased. The most promising signals for reliance calibration concentrated on counterfactual and retrieval-based explanations, which reduced over-reliance on incorrect AI outputs. In contrast, generic feature-attribution displays (e.g., SHAP) showed limited incremental benefit beyond AI advice alone. Some studies reported increased cognitive load and task time with explanations, particularly when dense or poorly integrated. Explanations in clinician-facing CDS often increase perceived trust and acceptance without reliably improving decision quality, and they can amplify harm when AI advice is incorrect or biased. Future evaluations should prioritize appropriate-reliance metrics stratified by AI correctness, incorporate objective workload and attention measures, and test explanation interfaces in workflow-realistic settings.
Haddadian et al. (Fri,) studied this question.