What question did this study set out to answer?

This research aims to improve the evaluation and explanation methods for multivariate time-series anomaly detection.

June 4, 2026Open Access

A Practical Framework for Event-Level Evaluation and Verifiable Counterfactual Explanation in Multivariate Time-Series Anomaly Detection

Key Points

This research aims to improve the evaluation and explanation methods for multivariate time-series anomaly detection.
Utilized DCdetector as a case study for event-level evaluation and explanation.
Conducted experiments on datasets SMAP, MSL, and HAI 21.03 using full-coverage scoring and standard metrics.
Ranked variables by counterfactual repair effect to assess their diagnostic value.
Point-adjusted scores were significantly higher than stricter event-level metrics.
Event-aware refinement improved event recovery while decreasing detection delay, depending on the dataset.
Quantitative evidence confirmed that ranked variables offered useful diagnostic information.

Abstract

Multivariate time-series anomaly detection is often evaluated with point-adjusted metrics, which can overstate practical performance when alarms are judged at the event level. Explanation results are also frequently reported as descriptive attributions without directly testing whether selected variables are useful for diagnosis. This study revisits these issues through unified event-level evaluation and repair-based explanation, using DCdetector as the main case study rather than proposing a new detector architecture. Experiments on SMAP, MSL, and HAI 21.03 use full-coverage score export and standard event-level control metrics. The results show that point-adjusted scores can be much higher than stricter event-level measurements. Event-aware refinement changes the detection trade-off by improving event recovery and reducing delay in several settings, but its effect is dataset- and calibration-dependent. For explanation, variables are ranked by exact marginal counterfactual repair effect and evaluated by whether repair reduces anomaly scores more than random or heuristic alternatives. The results provide quantitative evidence that the ranked variables are diagnostically informative, while exact marginal verification is computationally expensive and better suited to offline alarm review and post hoc diagnosis than latency-critical deployment. Auxiliary checks with TranAD, Anomaly-Transformer, and DADA support the plausibility of the main observations, but the evidence remains detector-conditioned rather than a fully backbone-agnostic benchmark. Overall, this work provides a stricter and more verifiable protocol for evaluating anomaly detection, event-aware refinement, and explanation quality in multivariate time-series monitoring.

A Practical Framework for Event-Level Evaluation and Verifiable Counterfactual Explanation in Multivariate Time-Series Anomaly Detection

Key Points

Abstract

Cite This Study