December 16, 2020

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Key Points

Key points are not available for this paper at this time.

Abstract

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR images and associated reports. External evaluation using the OpenI dataset shows that the joint embedding learned by pre-trained V+L models demonstrates performance improvement of 1.4% in thoracic finding classification tasks compared to a pioneering CNN+RNN model. Ablation studies are conducted to further analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. Attention maps are also visualized to illustrate the attention mechanism of V+L models.

KI fragen

Bookmark