Abstract Background Understanding the tumor microenvironment requires models that resolve cellular heterogeneity across molecular and spatial modalities. With the expansion of spatial transcriptomics, single-cell RNA-seq, and high-resolution histopathology imaging, there is a need for a unified foundation model that jointly interprets gene expression, spatial context, and visual tissue features. We developed a multimodal large language model (LLM) that integrates these modalities into a single adaptive framework handling heterogeneous inputs—including gene expression profiles, spatial transcriptomics spots, single-cell measurements, and histology patches—while generating harmonized outputs such as genes, cell types, and image-derived descriptors. Method We built a multimodal LLM within a Vision-Gene-Language (VGL) framework that integrates gene expression, histology images, and biological language representations. The model is based on MedGemma-4b-it and was fine-tuned using QLoRA for parameter-efficient training. Training used 5.2 million multimodal samples of H Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2754.
Shin et al. (Fri,) studied this question.