What question did this study set out to answer?

To benchmark the effectiveness of large language models in classifying wind turbine maintenance logs for improved operational analysis.

March 10, 2026Open Access

A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Key Points

To benchmark the effectiveness of large language models in classifying wind turbine maintenance logs for improved operational analysis.
Developed a reproducible framework for evaluating large language models
Systematically assessed proprietary and open-source models
Quantified performance in terms of reliability, operational efficiency, and model calibration
Identified consensus and performance hierarchy based on classification accuracy
No model achieved perfect accuracy, highlighting the role of human oversight
Models showed higher accuracy in identifying objective components than interpretive actions
Established a performance hierarchy indicating varying model reliability and confidence scores

Abstract

ABSTRACT Effective operation and maintenance (O&M) is critical to reducing the levelised cost of energy (LCOE) from wind power, yet the unstructured, free‐text nature of turbine maintenance logs presents a significant barrier to automated analysis. We address this by presenting a novel, reproducible framework for benchmarking large language models (LLMs) on classifying these complex industrial records, which is publicly available as an open‐source tool. We demonstrate the framework's utility by systematically evaluating a diverse suite of state‐of‐the‐art proprietary and open‐source LLMs, providing a practical case study for the foundational assessment of their trade‐offs in reliability, operational efficiency and model calibration. Benchmarked against a qualitatively selected high‐performing reasoning model, our study demonstrates how the tool quantifies a clear performance hierarchy to help analysts identify top models exhibiting high alignment and trustworthy confidence scores. Furthermore, we show that classification performance depends heavily on semantic ambiguity, with all models displaying higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and calibration varies dramatically, we conclude the most responsible application is a human‐in‐the‐loop system where LLMs act as assistants to accelerate and standardise data labelling, enhancing O&M data quality and downstream reliability analysis.

A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Key Points

Abstract

Cite This Study