e18000 Background: Current risk stratification and adjuvant treatment decisions for head and neck cancer following surgical resection rely primarily on pathological risk factors. The growing adoption of digital pathology presents an opportunity to leverage histopathological image features to enhance risk stratification accuracy. This study investigated the performance of machine learning models for treatment outcome prediction using image features and traditional clinical variables, and identified optimal strategies for combining these modalities. Methods: We analyzed data from 645 patients in the publicly available HANCOCK dataset for model development and testing. All patients had head and neck cancers treated with primary surgery with or without adjuvant therapy. For each patient, H&E-stained whole slide images of the primary tumor and nine clinical variables served as model inputs. Slide-level image embeddings were extracted using a pretrained vision-language pathology foundation model (TITAN). Clinical variables included tumor site, pT classification, pN classification, histologic grade, perineural invasion, lymphovascular invasion, extranodal extension, margin status, and smoking history. Models were trained to predict survival, with performance measured by Harrell's concordance index (C-index), using an machine learning-based Cox proportional hazards model (DeepSurv). We compared unimodal models (using either clinical variables or image features alone) with multimodal models utilizing different fusion strategies (concatenation, late fusion, and cross-attention). Five-fold cross-validation with early stopping was implemented during training. Results: Unimodal models achieved C-indices of 0.62 +/- 0.04 (clinical variables) and 0.65 +/- 0.07 (image features). Multimodal models demonstrated progressive improvement: late fusion (0.63 +/- 0.05), concatenation (0.67 +/- 0.07), and cross-attention (0.69 +/- 0.07), with cross-attention achieving the highest performance. Conclusions: H&E-stained whole slide images from resected head and neck cancers contain significant prognostic information. Multimodal AI models integrating histopathological images with clinical variables, particularly using cross-attention fusion, enhance prognostic prediction and may improve risk stratification for adjuvant therapy decisions. Harrell's concordance index for each type of model input and feature fusion methods. Model input C-index (mean +/- std) Clinical variables alone 0.62 +/- 0.04 Image features alone 0.65 +/- 0.07 Multimodal (late fusion) 0.63 +/- 0.05 Multimodal (concatenation) 0.67 +/- 0.07 Multimodal (cross attention) 0.69 +/- 0.07
Sun et al. (Thu,) studied this question.