Structured radiology reporting can improve clinical decision support by standardizing clinical findings into hierarchical formats. However, thousands of questions in structured report templates about clinical findings are prohibitively time-consuming, which can limit clinical adoption. Furthermore, early medical VQA datasets primarily focused on free-text and independent question–answer pairs while a recent dataset, Rad-ReStruct, introduced a hierarchical VQA, but the accompanying model still relies heavily on flattened embedding representations and single-path text–image fusion mechanisms that inadequately handle complex hierarchical dependencies in responses. In this paper, we propose DPA-HiVQA (Dual-Path Cross-Attention for Hierarchical VQA), addressing these limitations through two key contributions: (1) multi-scale image embedding representing global semantic embeddings with patch-level spatial features from domain-specific BioViL encoder; (2) dual-path cross-attention mechanism enabling simultaneous holistic semantic understanding and fine-grained spatial reasoning. Evaluated on the Rad-ReStruct benchmark, the model substantially outperforms the established benchmark baseline with an overall F1-score and Level 3 F1-score improvement by 21.2% and 31.9%, respectively. The proposed model demonstrates that dual-path cross-attention architectures can effectively connect holistic semantic understanding and fine-grained spatial detail, paving the way for practical AI-assisted structured reporting systems that reduce radiologist burden while maintaining diagnostic accuracy.
Do et al. (Fri,) studied this question.