What question did this study set out to answer?

The aim is to establish a standardized benchmark workflow for evaluating statistical downscaling methods in climate applications.

March 7, 2026Open Access

A Workflow for Benchmarking Added Value by New Statistical Downscaling Methods

Key Points

The aim is to establish a standardized benchmark workflow for evaluating statistical downscaling methods in climate applications.
Developed a transparent workflow for benchmarking downscaling methods.
Outlined a protocol including reference model, calibration criteria, and skill diagnostics.
Conducted tests on daily rainfall series and extremes in northern Serbia using automated and manual calibration approaches.
Machine learning (ML) methods outperformed benchmarks in 42% of diagnostic tests for extreme rainfall.
Automated calibration of SDSM succeeded in 33% of tests, while manually calibrated SDSM performed best in 25% of tests.
The findings suggest ML methods add value compared to the standard benchmark model.

Abstract

The past 30 years have witnessed a surge in the number of statistical downscaling techniques and applications. However, an absence of standardized approaches across studies has resulted in a bewildering array of methods that likely obstruct the effective use of downscaling in climate risk management. We address these challenges by demonstrating a transparent workflow for benchmarking downscaling methods. This incorporates a protocol for outlining the reference model, calibration criteria, skill diagnostics, and assessment metrics. When downscaling daily rainfall series and extremes in northern Serbia, we find that an automated calibration of our chosen benchmark model (SDSM) generally outperforms manual calibration for skill diagnostics encompassing rainfall occurrence, variability, and extremes. Additionally, we assess the added value of machine learning (ML) methods relative to the same benchmark. Our findings reveal superior performance of these advanced techniques when downscaling extreme rainfall, but less for rainfall occurrence when compared to the benchmark. Overall, the ML downscaling “won” 42% of our diagnostic tests, the automated SDSM 33% tests, and manually calibrated SDSM ranked first for 25% of the tests. This means that the ML methods do add value relative to the benchmark model (here, SDSM). These findings underscore the utility of our workflow, which also enabled us to identify specific avenues for enhancing the tested ML models.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper