Los puntos clave no están disponibles para este artículo en este momento.
Introduction Medical image captioning bridges visual perception and clinical language, but its development is limited by the high cost of detailed anatomical annotation and by the risk of hallucinations or overconfidence in ambiguous endoscopic images. Methods We propose ACE-Net, an Anatomy Collaborative Evidence Network for semi-supervised medical image captioning. ACE-Net integrates evidential deep learning into the visual encoding stage through an evidence-driven soft-gating mechanism that quantifies epistemic uncertainty and suppresses unreliable visual noise. A triple-guided Mixture-of-Experts decoder further organizes clinical reasoning into semantic anchoring, visual evidencing, and spatial calibration. Spatial consistency alignment is imposed within a teacher-student co-training framework to promote stable anatomical attention patterns without pixel-level supervision. Results On a high-resolution otolaryngology endoscopy dataset, ACE-Net achieved a BLEU-4 score of 0.7511 and a ROUGE-L score of 0.8728, demonstrating strong text-generation performance and improved anatomical grounding under limited annotation. Discussion These results suggest that effective anatomical localization can be induced through evidence-constrained global supervision rather than expensive pixel-level annotations, providing a data-efficient and reliable paradigm for medical image captioning.
Zhou et al. (Tue,) studied this question.