Abstract Current vision– Large Language Models (V-LLMs) for spinal oncology imaging use a black-box approach towards generating caption from an MRI/CT scan image. This conflicts with what a real radiologist goes through when interpreting the same images. These models take the pixels, traverse a billion plus parameter latent space, and generate a caption in one shot. This “all-at-once” approach ignores the multi-pass workflow that expert radiologists follow—checking image quality and vertebral levels, mapping baseline anatomy, surveying for disease, measuring epidural tumor extension and spinal-canal compromise and then integrating these findings into a structured report that drives surgical or radiotherapy decisions. Because the vision model’s intermediate reasoning remains hidden, clinicians cannot confirm that the AI model has examined every clinically critical cue, they can also not trace the origin of potential errors when the AI model misses a small sacral lesion or overstates the degree of canal stenosis. This opacity limits trust, complicates regulatory review, and ultimately slows the adoption of AI in oncologic imaging. It also slows down clinical adoption at scale. A total of 1978 expert-captioned studies—radiographs, CT, and MRI—were collected from the public ROCO-v2 corpus which also included images for spine. The proposed solution is an inference-time five-step Chain-of-Thought (CoT) reasoning that uses a fine tuned 7-billion-parameter vision–language model that guides the model to generate the following:Image-quality 2025 Sep 18-21; Baltimore, MD. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2025;34(9 Suppl):Abstract nr C036.
Malik et al. (Thu,) studied this question.