Abstract Histopathology in whole slide images (WSIs) serves as the gold standard for cancer diagnosis, with clinical reports playing a critical role in decision-making. However, the time-consuming nature of conventional pathological examination has driven increasing and urgent demand for automated report generation. Deep learning methods offer a certain potential to revolutionize this requirement by Histopathology Report Generation (HRG). Nevertheless, existing HRG approaches suffer from low-quality generation results due to ineffective exploration of multi-scale visual context in gigapixel WSIs and the inherent semantic gap between heterogeneous vision-language modalities. To address these challenges, we propose HC-Gen, a novel framework which synergistically combines hierarchical context modeling with prototype-mediate cross-modal alignment for HRG. Inspired by pathologists’ anatomically-grounded diagnostic logic, we design a hierarchical context fusion module to integrate multi-scale visual-semantic context and implicit hierarchy prior in WSIs. Furthermore, we propose a cross-modal prototypical memory module to establish learnable semantic prototypes as intermediate bridges to achieve unified and efficient vision-language alignment. Model performance was assessed through natural language generation metrics and human evaluation, extensive experiments on two benchmark datasets demonstrate that HC-Gen outperforms state-of-the-art methods. Extra visualization provides crucial support for the interpretability of the decision process. Our code is available at: https://github.com/Modaoshuangming/HC-Gen .
Ye et al. (Thu,) studied this question.