What question did this study set out to answer?

The aim is to create a normalized pan-cancer protein dataset to facilitate cross-cohort protein expression analysis.

March 25, 2026Open Access

Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow

Key Points

The aim is to create a normalized pan-cancer protein dataset to facilitate cross-cohort protein expression analysis.
Developed a curated and normalized protein expression dataset from CPTAC data.
Implemented systematic filtering and handled missing data using a novel algorithm.
Applied cohort hybrid imputation based on protein expression patterns.
Used global and smooth quantile normalization methods for protein intensity calculations.
Global quantile normalization showed higher rank correlation compared to smooth normalization and no normalization.
A combination of cohort hybrid imputation and global quantile normalization effectively normalized protein abundance values.
Demonstrated compatibility of the dataset for studying protein expression across multiple cancer types.

Abstract

Abstract The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) recently generated harmonized genomic, transcriptomic, proteomic, and clinical data for over 1,000 tumors across 10 cohorts to facilitate pan-cancer discovery research. However, protein expression comparison across CPTAC cohorts remains challenging due to non-uniform missing data and varying protein expression distribution patterns across tumor types. To enable the cancer research community to conduct robust cross-cohort protein expression analysis, we present a curated and normalized pan-cancer protein expression dataset derived from the CPTAC pan-cancer study. Our workflow integrates systematic filtering, various missing data handling and normalization strategies. We developed a novel algorithm to select robustly expressed proteins in tumors within any CPTAC cohort; applied a cohort hybrid imputation approach to protein abundance values from FragPipe within each cohort, based on protein expression distribution patterns; and calculated intensity-based absolute quantification using protein abundance values and applied both global and smooth quantile normalization methods. Our analysis demonstrates that global quantile normalization surpasses both smooth quantile normalization and no normalization, as evidenced by its higher rank correlation across cancer cohorts between CPTAC and TCGA for selected proteins. The findings suggest that combining cohort hybrid imputation with global quantile normalization is an effective method for creating a normalized CPTAC pan-cancer protein dataset, which can facilitate the study of protein expression across different cancer types and accelerate cancer research.

Bookmark

View Full Paper

Bookmark

View Full Paper

Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow

Key Points

Abstract

Cite This Study