Los puntos clave no están disponibles para este artículo en este momento.
BACKGROUND Rapid integration of Large Language Models (LLMs) in healthcare is sparking global discussion about their potential to revolutionize healthcare quality and accessibility. At a time where improving healthcare quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical exams has been used to argue in favour of their use in medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading healthcare misinformation has not been evaluated. OBJECTIVE This study aims to assess the effectiveness of LLMs from the perspective of a general user self-diagnosing to better understand the clarity, accuracy, and robustness of the models. METHODS We develop a comprehensive testing methodology based on a medical licensing exam to evaluate LLM responses to open-ended questions to mimic real-world use cases. RESULTS We reveal that a) ChatGPT-4.0 is marked as being correct 36% of the time by non-experts and experts, with only 34% agreement between them. Interestingly, b) when prompted with sentence dropout on the correct responses from a), non-experts tend to rate 27% additional responses as correct, which indicates an increased risk of spreading medical misinformation. CONCLUSIONS These results highlight the modest capabilities of LLMs since their responses are often unclear and inaccurate. A need exists to call the community to develop trustworthy solutions to reduce medical misinformation in LLMs.
Zada et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: