What question did this study set out to answer?

To assess the ability of pathology foundation models to predict molecular biomarkers from H&E staining in rare cancer cohorts.

May 30, 2026Open Access

What pathology foundation models can and cannot recover from H&E in rare cancers: a pre-registered molecular-biomarker benchmark across seven TCGA cohorts

Key Points

To assess the ability of pathology foundation models to predict molecular biomarkers from H&E staining in rare cancer cohorts.
Benchmarking of frozen pathology foundation model embeddings for molecular biomarker prediction across seven TCGA rare cohorts (ACC, UVM, MESO, CHOL, THYM, KICH, DLBC) with N=524.
Evaluation through patient-level stratified 5-fold cross-validation using L2-regularised logistic regression and gated-attention multiple-instance learning models.
Analysis included AUROC reporting, permutation-test p-values, sensitivity analyses, and comparisons of different models.
Mean AUROC across 28 primary cells was 0.643 with 8 cells exceeding 0.70, including DLBC MSI-H (AUROC 0.939, p=0.0002) and THYM GTF2I (0.933, p=0.0005).
Five biomarkers were recoverable post BH-FDR correction, while others like UVM GNAQ and NFB were not above chance.
Zero-shot transfer from common cohorts to rare cancers showed significant limitations, with TP53 classifier performing at or below chance.

Abstract

Pre-registered benchmark of frozen pathology foundation-model embeddings (UNI2-h, with Prov-GigaPath as a comparison) for molecular-biomarker prediction across seven rare TCGA cohorts (ACC, UVM, MESO, CHOL, THYM, KICH, DLBC). Background. Pathology foundation models (FMs) predict molecular biomarkers from H 95 tasks) contains no rare TCGA cohort. Methods. We benchmarked frozen pathology foundation-model slide embeddings (UNI2-h, mean-pooled to 1536 dimensions, as the primary model; Prov-GigaPath as a comparison) for molecular biomarker prediction across seven rare TCGA cohorts (ACC, UVM, MESO, CHOL, THYM, KICH, DLBC; 524 patients with matched embeddings and labels). Per a frozen pre-registration, we evaluated 30 primary and 9 exploratory cohort × biomarker cells using patient-level stratified 5-fold cross-validation with an L2-regularised logistic-regression linear probe (primary model) and a gated-attention multiple-instance-learning model (ABMIL; secondary). We report pooled out-of-fold AUROC with 1,000× bootstrap confidence intervals, permutation-test p-values with Benjamini–Hochberg FDR correction, and pre/post-calibration expected calibration error (ECE). A baseline-replication gate on CPTAC COAD MSI prediction was passed first. Nine pre-registered sensitivity analyses, a GTF2I–WHO-subtype confound test, and a comparison foundation model (Prov-GigaPath) with two-model ensembling were also evaluated. Results. Across 28 powered primary cells, mean AUROC was 0.643 (median 0.654; range 0.397–0.939) and 8/28 exceeded 0.70. Five primary cells were recoverable after BH-FDR correction (α=0.05): DLBC MSI-H (AUROC 0.939, p=0.0002), THYM GTF2I (0.933, p=0.0005; partially confounded by histological subtype, see below), THYM TMB-high (0.828, p=0.0005), UVM EIF1AX (0.817, p=0.0024) and UVM chromosome-3 loss (0.794, p=0.0010). Several biomarkers were not recoverable above chance, including the UVM MAPK drivers GNAQ (0.488) and GNA11 (0.525) and MESO NF2 (0.403). The pipeline was validated first by reproducing CPTAC COAD MSI prediction at AUROC 0.879 ± 0.094 (50 Patho-Bench splits), within the published literature range. Platt scaling reduced mean ECE from 0.230 to 0.064 (26/33 cells 0.05 in 9/31 cells), and Prov-GigaPath was near-equivalent to UNI2-h (mean AUROC 0.629 vs 0.631) with negligible ensemble gain, confirming the findings are not artefacts of model choice. The GTF2I result was partially confounded by WHO subtype (Cramér's V 0.601; within-B-type AUROC 0.872). Zero-shot transfer from common cohorts was largely unsuccessful: a TP53 classifier transferred at or below chance to all rare cohorts, and a TMB-high classifier degraded for two of three, though for DLBC it slightly exceeded within-cohort performance. Conclusions. Frozen pathology-FM embeddings recover a specific subset of molecular biomarkers in rare cancers (driver mutations and chromosomal events with established morphological correlates) while leaving many biomarkers inaccessible at these sample sizes. Four of the five hold under the more conservative Benjamini–Yekutieli correction. The contribution is a calibrated, FDR-controlled, multi-model map of what current foundation models can and cannot recover from rare-cancer histology, rather than a single headline performance figure.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper