What question did this study set out to answer?

The study aims to compare ChatGPT-4o and DeepSeek-V3.1 in providing accurate responses about premature ovarian insufficiency based on international guidelines.

May 14, 2026

Comparative evaluation of DeepSeek-V3.1 and ChatGPT-4o on POI assessment and management: An exploratory cross-sectional study with an international consensus guideline.

Key Points

The study aims to compare ChatGPT-4o and DeepSeek-V3.1 in providing accurate responses about premature ovarian insufficiency based on international guidelines.
Cross-sectional evaluation of responses to 26 POI-related questions by LLMs.
Assessment by gynecological experts and nonmedical volunteers using Likert scales and standardized readability formulas.
Comparison of scores for accuracy, completeness, professionalism, and satisfaction.
DeepSeek-V3.1 had better accuracy and professionalism scores than ChatGPT-4o (P < .001).
DeepSeek achieved a slightly higher satisfaction score (6 vs. 6; P = .013) compared to ChatGPT-4o.
ChatGPT-4o scored higher in readability (P = .04), suggesting user preference despite lower satisfaction.

Abstract

Large language models (LLMs), such as ChatGPT and DeepSeek, are increasingly utilized in clinical decision support. This study aimed to compare the performance of ChatGPT-4o and DeepSeek-V3.1 in answering questions related to premature ovarian insufficiency (POI) based on international POI guideline consensus. It assesses the accuracy, reliability, and effectiveness of LLMs in the medical field for disease management. This cross-sectional study evaluated ChatGPT-4o and DeepSeek-V3.1's responses to 26 POI-related questions. These questions were formulated by gynecological experts based on the latest guidelines and categorized into 4 themes: diagnosis, long-term health risks, treatment options, and common concerns. Experts and nonmedical volunteers assessed the responses for accuracy, completeness, professionalism, or satisfaction using Likert scales, while readability was analyzed using standardized formulas. The median scores of DeepSeek-V3.1 in terms of accuracy, completeness, and professionalism were slightly better than those of ChatGPT-4o (P < .001, P < .001, P = .002, respectively), with both models performing at an upper-intermediate level. Nonmedical volunteers assessed that DeepSeek-V3 achieved a slightly higher satisfaction score (6 IQR 6-7) compared to ChatGPT-4o (6 IQR 6-6; P = .013). ChatGPT-4o scored higher on readability, making it more acceptable (P = .04). LLMs provide faster recommendations for patients seeking medical assistance and alleviate clinical workload to some extent, contributing to healthcare development. In summary, this exploratory study found that the DeepSeek-V3.1 model yielded more satisfactory results compared to ChatGPT-4o, though both applications require further optimization for safe integration into clinical decision support.

Ask AI

Helpful

Bookmark