Purpose Venous thromboembolism (VTE) is a major cause of maternal morbidity and mortality, and nursing plays a central role in prevention, patient education, and follow-up. Large language models (LLMs) have attracted increasing attention in healthcare; however, their comparative performance in maternal VTE nursing contexts remains insufficiently explored. Methods Five representative LLMs—DeepSeek, GPT-4.1, Claude 3.7, Huatuo, and Kimi—were evaluated across six clinical domains (etiology, diagnosis, treatment, prognostic assessment, home care, prevention) and five performance dimensions (accuracy, comprehensibility, logical coherence, reliability, safety). An expert-informed Delphi framework comprising 41 items guided the evaluation. Three nursing experts independently rated each model’s responses, and inter-rater reliability was assessed using Fleiss’s Kappa. Results GPT-4.1, Claude 3.7, and DeepSeek demonstrated superior overall performance, particularly in patient education, individualized care planning, and preventive guidance. Huatuo and Kimi showed limitations in treatment and prognostic reasoning. Inter-rater reliability was excellent (Kappa = 0.892). Conclusion The findings highlight relative strengths and limitations of different LLMs across nursing-relevant domains in maternal VTE care. While certain models performed better in educational and supportive contexts, the current study does not assess clinical adequacy or readiness for real-world nursing deployment. Future research incorporating patient perspectives and real-world validation is needed to inform the safe and appropriate integration of LLMs into nursing practice.
Li et al. (Fri,) studied this question.