Key points are not available for this paper at this time.
OBJECTIVE: To evaluate and compare the performance of large language models (LLMs) in identifying contributing factors (CFs) underlying patient safety incident investigations. MATERIALS AND METHODS: Four open-source, lightweight LLMs, including BERT, LLaMA2, GPT2, and Phi-2 were applied to classify CFs across 6 sociotechnical system-levels encompassing 12 categories (eg, person, task, and organizational factors). Reports of real-world patient safety investigations from public health systems were extracted and labelled by domain experts (nᵣeport/CFs = 300/1338). Data were split into training (n = 852), validation (n = 98), and test sets (n = 388). Performance was evaluated using specificity, precision, recall, and F1 scores. RESULTS: The fine-tuned encoder-based BERT model achieved the highest performance, with a micro-averaged F1 score of 63. 6%, outperforming all decoder-based models. Among the decoder models, Phi-2 demonstrated the strongest performance (F1 = 54. 9%), exceeding both LLaMA2 and GPT2. BERT performed consistently across 6 system-levels but often misclassified "organization" as "person". DISCUSSION: LLMs hold promise for automating the extraction of CFs from complex safety narratives, particularly for frequently reported system-levels such as "person" and "tasks". Such automation may substantially reduce the manual effort required to analyse reports of patient safety investigations while supporting more consistent analysis across large incident datasets. CONCLUSION: Applying LLMs to analyse the underlying causes of patient safety incidents depends on developing high-quality, domain-specific datasets that enhance the representation of patient safety knowledge and improve model understanding of incident causation. Improving data coverage for rare system-levels is essential to address the current limitations of LLMs in capturing nuanced patient safety concepts and domain-specific reasoning.
Wang et al. (Wed,) studied this question.