Driven by the high penetration of renewable energy, the inherent intermittency of photovoltaic (PV) generation poses severe challenges to grid stability. Consequently, precise PV system modeling and power forecasting have emerged as critical mitigation technologies. However, existing research predominantly focuses on algorithmic and dispatch optimization, frequently overlooking data quality—the core determinant of the upper bound of model performance. To address data fragmentation and inconsistency within the PV domain, this paper presents a systematic review of publicly available benchmark datasets. We categorize these essential resources into three primary domains: 1) meteorological datasets, classified by generation mechanisms into real, synthetic, mixed, and reanalysis types; 2) PV generation datasets, organized by their temporal resolution; and 3) static system datasets, subdivided into plant level geospatial data and module level physical parameters. By analyzing the intrinsic characteristics of these diverse resources, this review elucidates application mapping strategies tailored to specific scenarios, including machine learning modeling, physical simulation, and optimal dispatch. Ultimately, this work provides researchers with an authoritative guide and empirical evidence for robust data selection and model construction.
Zhao et al. (Sun,) studied this question.