Pre-registered benchmark of frozen pathology foundation-model embeddings (UNI2-h, with Prov-GigaPath as a comparison) for molecular-biomarker prediction across seven rare TCGA cohorts (ACC, UVM, MESO, CHOL, THYM, KICH, DLBC). Background. Pathology foundation models (FMs) predict molecular biomarkers from H 95 tasks) contains no rare TCGA cohort. Methods. We benchmarked frozen pathology foundation-model slide embeddings (UNI2-h, mean-pooled to 1536 dimensions, as the primary model; Prov-GigaPath as a comparison) for molecular biomarker prediction across seven rare TCGA cohorts (ACC, UVM, MESO, CHOL, THYM, KICH, DLBC; 524 patients with matched embeddings and labels). Per a frozen pre-registration, we evaluated 30 primary and 9 exploratory cohort × biomarker cells using patient-level stratified 5-fold cross-validation with an L2-regularised logistic-regression linear probe (primary model) and a gated-attention multiple-instance-learning model (ABMIL; secondary). We report pooled out-of-fold AUROC with 1,000× bootstrap confidence intervals, permutation-test p-values with Benjamini–Hochberg FDR correction, and pre/post-calibration expected calibration error (ECE). A baseline-replication gate on CPTAC COAD MSI prediction was passed first. Nine pre-registered sensitivity analyses, a GTF2I–WHO-subtype confound test, and a comparison foundation model (Prov-GigaPath) with two-model ensembling were also evaluated. Results. Across 28 powered primary cells, mean AUROC was 0.643 (median 0.654; range 0.397–0.939) and 8/28 exceeded 0.70. Five primary cells were recoverable after BH-FDR correction (α=0.05): DLBC MSI-H (AUROC 0.939, p=0.0002), THYM GTF2I (0.933, p=0.0005; partially confounded by histological subtype, see below), THYM TMB-high (0.828, p=0.0005), UVM EIF1AX (0.817, p=0.0024) and UVM chromosome-3 loss (0.794, p=0.0010). Several biomarkers were not recoverable above chance, including the UVM MAPK drivers GNAQ (0.488) and GNA11 (0.525) and MESO NF2 (0.403). The pipeline was validated first by reproducing CPTAC COAD MSI prediction at AUROC 0.879 ± 0.094 (50 Patho-Bench splits), within the published literature range. Platt scaling reduced mean ECE from 0.230 to 0.064 (26/33 cells 0.05 in 9/31 cells), and Prov-GigaPath was near-equivalent to UNI2-h (mean AUROC 0.629 vs 0.631) with negligible ensemble gain, confirming the findings are not artefacts of model choice. The GTF2I result was partially confounded by WHO subtype (Cramér's V 0.601; within-B-type AUROC 0.872). Zero-shot transfer from common cohorts was largely unsuccessful: a TP53 classifier transferred at or below chance to all rare cohorts, and a TMB-high classifier degraded for two of three, though for DLBC it slightly exceeded within-cohort performance. Conclusions. Frozen pathology-FM embeddings recover a specific subset of molecular biomarkers in rare cancers (driver mutations and chromosomal events with established morphological correlates) while leaving many biomarkers inaccessible at these sample sizes. Four of the five hold under the more conservative Benjamini–Yekutieli correction. The contribution is a calibrated, FDR-controlled, multi-model map of what current foundation models can and cannot recover from rare-cancer histology, rather than a single headline performance figure.
Hayden Farquhar (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: