Los puntos clave no están disponibles para este artículo en este momento.
= 58.26, p < 0.001): Gemini achieved 87% (95% CI 84-91%), ChatGPT 85% (81-89%), and MedGemma 67% (62-71%). All models reported high confidence, but calibration was modest. Mean confidence differed only slightly between correct and incorrect responses (absolute differences <3%). Brier scores indicated imperfect calibration (Gemini 0.115, ChatGPT 0.137, MedGemma 0.262). ChatGPT demonstrated the strongest confidence-accuracy correlation (r = 0.80, p = 0.005), while Gemini and MedGemma showed weak or nonsignificant alignment. MedGemma exhibited higher uncertainty and lower fidelity across categories. Performance varied by subspecialty, with generalist models outperforming in integrative domains. Generalpurpose LLMs outperformed a medical-specific model on text-based cardiology assessment, suggesting that large-scale general training may confer advantages in complex clinical reasoning. However, all models showed clinically limited confidence calibration, indicating that self-reported certainty is an unreliable indicator of correctness. Until uncertainty estimation improves, LLM use in cardiology should remain supportive and clinician-supervised.
Zidan et al. (Fri,) studied this question.