This multi-source dataset was compiled to support research on anomaly-based leak detection in urban water distribution networks (WDNs). It contains one year of hourly data collected from a Slovak water utility, combining supervisory control and data acquisition (SCADA) measurements (flow and pressure), energy consumption variables (kilowatts, kW; kilovolt-ampere reactive, kvar), and environmental indicators such as groundwater level, temperature, and humidity. All features were transformed into standardized anomaly scores on a 0–100 scale using Elastic Machine Learning and Isolation Forest methods. Confirmed leak records from the utility’s operational information system were mapped to binary labels using a ± 7-day temporal window. Feature selection resulted in 18 variables retained based on their statistical association with leak labels using the Goodman–Kruskal gamma coefficient. The dataset can be used for benchmarking anomaly detection and prediction models, evaluating lead-time sensitivity, and developing data-driven early-warning systems. It is also suitable for studies on spatially segmented WDN analysis and is publicly available to support reproducible research.
Bábela et al. (Fri,) studied this question.