What question did this study set out to answer?

This research aims to evaluate the performance of different large language models in supporting nursing care for maternal venous thromboembolism.

April 28, 2026Open Access

Evaluation of large language models for nursing support in maternal venous thromboembolism care

Key Points

This research aims to evaluate the performance of different large language models in supporting nursing care for maternal venous thromboembolism.
Evaluated five large language models across six clinical domains relevant to maternal VTE care.
Used a Delphi framework comprising 41 items for evaluation, rated by three nursing experts.
Assessed inter-rater reliability using Fleiss’s Kappa.
GPT-4.1, Claude 3.7, and DeepSeek showed superior performance in patient education and individualized care planning.
Huatuo and Kimi had limitations, particularly in treatment and prognostic reasoning.
Inter-rater reliability was excellent, with a Kappa score of 0.892.

Abstract

Purpose Venous thromboembolism (VTE) is a major cause of maternal morbidity and mortality, and nursing plays a central role in prevention, patient education, and follow-up. Large language models (LLMs) have attracted increasing attention in healthcare; however, their comparative performance in maternal VTE nursing contexts remains insufficiently explored. Methods Five representative LLMs—DeepSeek, GPT-4.1, Claude 3.7, Huatuo, and Kimi—were evaluated across six clinical domains (etiology, diagnosis, treatment, prognostic assessment, home care, prevention) and five performance dimensions (accuracy, comprehensibility, logical coherence, reliability, safety). An expert-informed Delphi framework comprising 41 items guided the evaluation. Three nursing experts independently rated each model’s responses, and inter-rater reliability was assessed using Fleiss’s Kappa. Results GPT-4.1, Claude 3.7, and DeepSeek demonstrated superior overall performance, particularly in patient education, individualized care planning, and preventive guidance. Huatuo and Kimi showed limitations in treatment and prognostic reasoning. Inter-rater reliability was excellent (Kappa = 0.892). Conclusion The findings highlight relative strengths and limitations of different LLMs across nursing-relevant domains in maternal VTE care. While certain models performed better in educational and supportive contexts, the current study does not assess clinical adequacy or readiness for real-world nursing deployment. Future research incorporating patient perspectives and real-world validation is needed to inform the safe and appropriate integration of LLMs into nursing practice.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluation of large language models for nursing support in maternal venous thromboembolism care

Key Points

Abstract

Cite This Study