Background: Large language models (LLMs) are increasingly used in healthcare, but standardized benchmarks fail to capture their validity and safety in real-world scenarios. Evaluating their quality and reliability is critical for safe integration into practice. Methods: Four fictitious clinical vignettes (orthopedics, pediatrics, gynecology, psychiatry) were developed by independent specialists and tested in four conversational agents: ArkangelAI, OpenEvidence, ChatGPT, and Medisearch. Each vignette included four questions (diagnosis, management, research, and general knowledge). Responses were evaluated by four external clinicians using an eight-criterion Likert scale: 1-2 = dissatisfaction, 3 = neutral, 4-5 = satisfaction, 6 = not applicable. The criteria considered correctness, consensus, bias, standard of care, updated information, patient safety, real sources in references, and context-awareness. Response times were measured with medians and interquartile ranges (IQR). Results were reported as frequencies. Hypothesis tests were applied (alpha = 0.05). Results: We assessed 128 question-answer (Q&A) pairs (1024 evaluations). ArkangelAI-Deep was the highest in satisfaction (92.9%), followed by OpenEvidence (83.6%), ChatGPT-Deep (80.5%), and Medisearch (71.1%). The most dissatisfaction was for the real source of references: GPT-Personalized 75%, GPT-Regular 97%. Conversely, ArkangelAI-Deep, ChatGPT-Deep, and OpenEvidence obtained perfect marks in satisfaction (100%). All performed well in correctness and agreement with the consensus. ChatGPT was the lowest-scoring in non-biased answers. The safest for patients was GPT-Personalized, followed by ArkangelAI-Deep. By specialty, gynecology scored the highest, whereas pediatrics had the lowest. Response times varied widely: Medisearch was fastest (18 s), while GPT-Deep (13 min) and ArkangelAI-Deep (7.4 min) were slowest, showing a trade-off between depth and usability. Conclusions: Conversational agents showed marked performance, safety, and stability. ArkangelAI-Deep and OpenEvidence consistently outperformed others, while Medisearch and GPT-Regular had significant limitations. These results underscore the need for standardized frameworks to ensure safe use of LLMs in healthcare.
Castano-Villegas et al. (Thu,) studied this question.