March 4, 2024Open Access

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

ZTZiqi TangCold Spring Harbor Laboratory NSNikunj V. SomiaUniversity of Minnesota YYYiyang YuColumbia University

Key Points

Key points are not available for this paper at this time.

Abstract

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis -regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

AI에게 질문

Bookmark

View Full Paper