What question did this study set out to answer?

This research aims to enhance medical image classification and report generation through a novel multitask learning framework that integrates contrastive learning and natural language generation.

March 13, 2026Open Access

SMILES challenge 2025: Multitask learning with contrastive and natural language generation for enhanced medical image classification

Key Points

This research aims to enhance medical image classification and report generation through a novel multitask learning framework that integrates contrastive learning and natural language generation.
Developed a multitask learning framework using contrastive learning and natural language generation.
Utilized a Vision Transformer as a visual encoder and a transformer-based text encoder.
Trained the model jointly with image-text contrastive loss and language generation loss.
Evaluated on MIMICCXR and Chexpert datasets to assess disease classification accuracy.
Accuracy for Atelectasis improved from 17.44% to 41.5% in MIMICCXR dataset.
For Cardiomegaly, accuracy increased from 19.25% to 47.4% in MIMICCXR.
In Chexpert, Atelectasis accuracy rose from 12.5% to 58.5%.
Pleural Effusion accuracy improved from 61.10% to 64.0%.
Improvements were also noted in F1 scores for complex diseases like Cardiomegaly and Consolidation.

Abstract

Abstract This article proposes a novel multitask learning framework that integrates contrastive learning and natural language generation (NLG) to enhance medical image classification and report generation. The goal is to improve disease classification accuracy and interpretability in medical diagnostics. The model architecture consists of a Vision Transformer (ViT) as a visual encoder, a transformer-based text encoder, and a multimodal decoder. The visual encoder processes medical images, while the text encoder handles disease-related text prompts. These components are trained jointly using image-text contrastive loss and language generation loss. Evaluations on the MIMICCXR and Chexpert datasets show that the model with NLG (Plain + NLG) outperforms the baseline contrastive learning model (Plain) in disease classification. For example, in the MIMICCXR dataset, the accuracy for Atelectasis increased from 17.44%(Plain) to 41.5% (Plain + NLG), and for Cardiomegaly, it improved from 19.25% to 47.4%. In Chexpert, the accuracy for Atelectasis increased from 12.5% to 58.5%, and for Pleural Effusion, from 61.10% to 64.0%. The model also demonstrated improvements in F1 scores, particularly for complex diseases like Cardiomegaly and Consolidation. The proposed multitask framework effectively combines contrastive learning with NLG, leading to improved disease classification and medical report generation. This approach has potential clinical applications by enhancing AI’s interpretability and accuracy in medical decision-making.

Bookmark

View Full Paper

Cite This Study

Vavekanand et al. (Sun,) studied this question.

synapsesocial.com/papers/69b3ab3c02a1e69014ccbf73 https://doi.org/https://doi.org/10.1007/s11760-026-05214-8

Bookmark

View Full Paper