December 16, 2020

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Key Points

Key points are not available for this paper at this time.

Abstract

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR images and associated reports. External evaluation using the OpenI dataset shows that the joint embedding learned by pre-trained V+L models demonstrates performance improvement of 1.4% in thoracic finding classification tasks compared to a pioneering CNN+RNN model. Ablation studies are conducted to further analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. Attention maps are also visualized to illustrate the attention mechanism of V+L models.

AIに質問

Bookmark

Cite This Study

Li et al. (Wed,) studied this question.

synapsesocial.com/papers/6a192533c05413006f57eda9 https://doi.org/https://doi.org/10.1109/bibm49941.2020.9313289

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AIに質問

Bookmark