ABSTRACT Power system load forecasting is essential for modern power grids, as it directly influences operational efficiency, resource scheduling and energy management. Traditional forecasting approaches, which rely on statistical analysis and handcrafted mathematical models, often struggle to capture the nonlinear, high‐dimensional, and dynamically evolving patterns exhibited in real‐world load data. To address these limitations, this study proposes a forecasting framework that incorporates a large language model (LLM) enhanced by a dynamic knowledge distillation mechanism. The framework first employs a cross‐attention–based feature fusion module to integrate historical load data with auxiliary contextual variables. A pretrained GPT‐2 model is then fine‐tuned to extract temporal dependencies and serve as the teacher network. To reduce computational cost and improve deployability, a dynamic knowledge distillation strategy is introduced to guide a lightweight student transformer model during training. Experimental results demonstrate that the proposed method achieves superior forecasting accuracy compared with representative state‐of‐the‐art models, while the distilled student network significantly reduces computational load, making the approach suitable for practical real‐time applications.
Zong et al. (Thu,) studied this question.