Abstract No LLMs (Large Language Models) have yet been evaluated for understanding picture reports. Pure-tone audiograms, the gold standard for hearing loss assessment, are technical and often incomprehensible to patients without specialist interpretation. We conducted a blinded, multicenter evaluation of eight LLMs across diagnostic, interpretive, and recommendation tasks using 140 audiogram reports, assessed by clinicians and lay reviewers. The study revealed that DeepSeek-V3 achieved the highest diagnostic accuracy (severity: 67.00% ; type: 54.00%), R1 proved most suitable for general readership (FKGL: 6.41). The general public perceived significant benefits from all models in comprehension and emotional support, with Gemini 2.0 Flash/Thinking scoring higher. Challenges remain in understanding pathological mechanisms and controlling hallucinations. While current general-purpose LLMs cannot replace the diagnostic capabilities of physicians, they may serve as effective auxiliary tools for translating specialized audiogram data into structured, patient-accessible interpretations, with particular relevance for populations facing limited access to hearing-care services.
Li et al. (Sun,) studied this question.