Large-scale medical visual question answering (MedVQA) datasets are critical for training and deploying vision–language models (VLMs) in radiology. Ideally, such datasets should be automatically constructed from routine radiology reports and their corresponding images. However, no existing method directly links free-text findings to the most relevant 2D slices in volumetric computed tomography (CT) scans. To address this gap, a contrastive language–image pre-training (CLIP)-based key slice selection framework is proposed, which matches each sentence to its most informative CT slice via text–image similarity. This experiment demonstrates that models pre-trained in the medical domain already achieve competitive slice retrieval accuracy and that fine-tuning them on a small dual-supervised dataset that imparts both lesion- and organ-level awareness yields further gains. In particular, the best-performing model (fine-tuned BiomedCLIP) achieved a Top-1 accuracy of 51.7% for lesion-aware slice retrieval, representing a 20-point improvement over baseline CLIP, and was accepted by radiologists in 56.3% of cases. By automating the report-to-slice alignment, the proposed method facilitates scalable, clinically realistic construction of MedVQA resources.
Yamamoto et al. (Fri,) studied this question.