What type of study is this?

August 22, 2025Open Access

Robust Distance Correlation for Variable Screening

Key Points

The proposed robust distance correlation method effectively identifies critical features from heavy-tailed data.
Simulations show that this method outperforms existing techniques for dimensionality reduction.
Observational analysis of pancreatic cancer RNA-seq data demonstrates its application and effectiveness.
Highlighting biological relevance, the method identifies genes predictive of MAPK1 protein expression.

Abstract

ABSTRACT In modern statistical applications, identifying critical features in high‐dimensional data is essential for scientific discoveries. Traditional best subset selection methods face computational challenges, while regularization approaches such as Lasso, SCAD and their variants often exhibit poor performance with ultrahigh‐dimensional data. Sure screening methods, widely used for dimensionality reduction, have been developed as popular alternatives, but few target heavy‐tailed characteristics in modern big data. This paper introduces a new sure screening method, based on robust distance correlation (‘RDC’), designed for heavy‐tailed data. The proposed method inherits the benefits of the original model‐free distance correlation‐based screening while robustly estimating distance correlation in the presence of heavy‐tailed data. We further develop an FDR control procedure by incorporating the Reflection via Data Splitting (REDS) method. Extensive simulations demonstrate the method's advantage over existing screening procedures under different scenarios of heavy‐tailedness. Its application to high‐dimensional heavy‐tailed RNA‐seq data from The Cancer Genome Atlas (TCGA) pancreatic cancer cohort showcases superior performance in identifying biologically meaningful genes predictive of MAPK1 protein expression critical to pancreatic cancer.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper