What question did this study set out to answer?

This research aims to evaluate the predictive performance and physical consistency of deep learning models for wildfire danger prediction.

May 2, 2026Open Access

Assessing and explaining temporal deep learning models for wildfire danger prediction

Key Points

This research aims to evaluate the predictive performance and physical consistency of deep learning models for wildfire danger prediction.
Assessed seven temporal deep learning models against random forest and XGBoost for next-day wildfire prediction.
Applied explainable AI methods to interpret model attributions and evaluate alignment with fire science.
Conducted case studies on two fire events in Spain to analyze model predictions.
All deep learning models outperformed the random forest and XGBoost baselines, with Transformer models achieving the highest predictive accuracy (F1-score 0.81).
The RF and XGBoost models captured 13 and 12 expected fire-driver relationships respectively, while deep learning models captured up to 11.
Case studies highlighted a difference in predictions based on driver representation, with one model predicting a heatwave-led fire accurately and missing a lightning-caused event.

Abstract

Modern methods for wildfire danger prediction are critical for mitigating the detrimental impacts of fires on ecosystems, public health, and the economy. While Machine Learning has emerged as a powerful approach to model the complex interactions driving wildfire risk, its ‘black-box’ nature creates a trade-off between predictive skill and physical plausibility and interpretability required for trustworthy risk assessments. In this study, we systematically assess the predictive performance and physical consistency of seven temporal deep learning (DL) models against two decision tree-based baselines, random forest (RF) and XGBoost (XGB), for next-day wildfire danger prediction in the Mediterranean. We apply explainable AI (xAI) methods to interpret model attributions and assess their broad alignment with established fire science. Results show that all DL models outperform RF and XGB baselines, with Transformer models achieving the highest predictive accuracy (F₁-score 0. 81), significantly outperforming the RF baseline (post-hoc Dunn test, p < 10^-5) and by effectively capturing long-range temporal dependencies. However, xAI analyses reveal a key trade-off: despite their higher predictive performance, DL models exhibit lower physical consistency in their averaged driver relationships. Specifically, when evaluated against 19 expected fire-driver relationships, the RF and XGB correctly capture 13 (12) relationships, whereas DL models capture at most 11. We further investigate how Transformers generated individual wildfire danger predictions through case studies of two similar large fire events in Spain, one correctly predicted (true positive) and one missed (false negative). The analysis demonstrates how differences in driver representation can lead to divergent predictions, such as correctly identifying a heatwave-driven event but missing a lightning-induced ignition. Together, these investigations provide a structured evaluation of a wide range of DL models in terms of their predictive accuracy and physical consistency, offering guidance for future wildfire danger forecasting in fire-prone regions, such as the Mediterranean.

Mark Helpful

Bookmark

Relay

View Full Paper