What question did this study set out to answer?

The aim is to develop a framework that accurately assesses the malignancy risk of gastrointestinal lesions using a transformer-based model.

April 4, 2026Open Access

GastroMalign: Vision Transformer-Based Framework for Early Detection and Malignancy-Risk Stratification for High-Risk Gastrointestinal Lesions

Key Points

The aim is to develop a framework that accurately assesses the malignancy risk of gastrointestinal lesions using a transformer-based model.
Retrospective development and validation using the GastroVision dataset with 8000 endoscopic images.
The framework integrates a Vision Transformer encoder and a Sequential Feature Learner for modeling disease severity.
Images were divided into training (80%), validation (10%), and test (10%) sets for performance evaluation.
GastroMalign's performance was compared to convolutional neural networks and a Swin Transformer.
Interpretability was assessed with Score-CAM visualizations analyzed by expert endoscopists.
GastroMalign achieved 80.06% accuracy, 79.65% precision, 80.06% recall, and an F1-score of 79.17%.
The micro-averaged AUC was 0.98, outperforming ResNet-50 and DenseNet-121's accuracies of 32.42% and 36.77%.
The Swin Transformer had an accuracy of 60.56% (AUC = 0.93).
Removing the progression-aware module led to a 17% reduction in High-Risk lesion recall.
Malignancy risk scores increased across ordinal classes, with benign lesions scoring <0.18 and high-risk lesions >0.72.

Abstract

Background: Current artificial intelligence (AI) systems in gastrointestinal (GI) endoscopy primarily emphasize binary detection or static classification, providing limited support for the graded assessment of malignant potential that underpins clinical decision-making. We developed GastroMalign, a transformer-based framework designed to stratify GI lesions according to ordinal disease severity while maintaining clinical interpretability, addressing this unmet need in endoscopic risk assessment. Methods: This retrospective development and validation study used the publicly available GastroVision dataset, comprising 8000 de-identified endoscopic still images from the upper and lower gastrointestinal tract, including the esophagus, stomach, duodenum, colon, rectum, and terminal ileum. GastroMalign integrates a Vision Transformer (ViT) encoder with a Sequential Feature Learner that explicitly models ordinal disease severity along a benign-to-malignant spectrum. The framework produces both categorical risk classification and a continuous malignancy risk score. Images were stratified into training (80%), validation (10%), and test (10%) sets. Performance was compared with convolutional neural network (CNN) baselines and a Swin Transformer. Interpretability was assessed using Score-CAM visualizations reviewed by blinded expert endoscopists. Results: On the held-out test set (n = 800 images), GastroMalign achieved an overall accuracy of 80.06%, precision of 79.65%, recall of 80.06%, and F1-score of 79.17%, with a micro-averaged AUC of 0.98. In comparison, ResNet-50 and DenseNet-121 achieved accuracies of 32.42% and 36.77%, respectively, while the Swin Transformer achieved 60.56% accuracy (AUC = 0.93). Ablation analyses demonstrated a 17% absolute reduction in High-Risk lesion recall when the progression-aware module was removed. Continuous malignancy risk scores increased monotonically across ordinal classes, with mean values 0.72 for High-Risk/Malignant lesions. Score-CAM visualizations demonstrated 92% overlap with clinician-annotated lesion regions. Conclusions: GastroMalign delivers an interpretable, progression-aware AI framework for GI lesion risk stratification that outperforms existing CNN- and transformer-based models. Clinically, GastroMalign is intended as an adjunct decision-support tool during endoscopic review to standardize lesion risk stratification (benign to malignant spectrum), support management decisions (biopsy vs. resection vs. surveillance), and reduce operator-dependent variability by pairing ordinal risk outputs with interpretable visual explanations.

GastroMalign: Vision Transformer-Based Framework for Early Detection and Malignancy-Risk Stratification for High-Risk Gastrointestinal Lesions

Key Points

Abstract

Cite This Study