Large language models (LLMs), such as ChatGPT and DeepSeek, are increasingly utilized in clinical decision support. This study aimed to compare the performance of ChatGPT-4o and DeepSeek-V3.1 in answering questions related to premature ovarian insufficiency (POI) based on international POI guideline consensus. It assesses the accuracy, reliability, and effectiveness of LLMs in the medical field for disease management. This cross-sectional study evaluated ChatGPT-4o and DeepSeek-V3.1's responses to 26 POI-related questions. These questions were formulated by gynecological experts based on the latest guidelines and categorized into 4 themes: diagnosis, long-term health risks, treatment options, and common concerns. Experts and nonmedical volunteers assessed the responses for accuracy, completeness, professionalism, or satisfaction using Likert scales, while readability was analyzed using standardized formulas. The median scores of DeepSeek-V3.1 in terms of accuracy, completeness, and professionalism were slightly better than those of ChatGPT-4o (P < .001, P < .001, P = .002, respectively), with both models performing at an upper-intermediate level. Nonmedical volunteers assessed that DeepSeek-V3 achieved a slightly higher satisfaction score (6 IQR 6-7) compared to ChatGPT-4o (6 IQR 6-6; P = .013). ChatGPT-4o scored higher on readability, making it more acceptable (P = .04). LLMs provide faster recommendations for patients seeking medical assistance and alleviate clinical workload to some extent, contributing to healthcare development. In summary, this exploratory study found that the DeepSeek-V3.1 model yielded more satisfactory results compared to ChatGPT-4o, though both applications require further optimization for safe integration into clinical decision support.
Chen et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: