What question did this study set out to answer?

The aim is to reduce the requirement for labeled data in biological sequence classification while maintaining accuracy.

June 17, 2026

Language model-based self-training reduces labeled data requirements by 99% for biological sequence classification

Key Points

The aim is to reduce the requirement for labeled data in biological sequence classification while maintaining accuracy.
Implemented a framework combining pre-trained language models with semi-supervised learning.
Utilized pseudo-label selection to enhance label-level certainty with minimal labeled data.
Evaluated on DNA-binding protein and non-coding RNA detection tasks.
Achieved 99% reduction in labeled data requirements compared to traditional methods.
Performance competitive with fully supervised approaches using only 1% of the data.
Successfully identified novel biomolecules in low-resource settings.

Abstract

Abstract The prediction of a biological sequence’s function is an exploding field that could prove to be an invaluable tool in deciphering molecular mechanisms of diseases and genetic variation. However, existing techniques suffer from a very low number of experimentally verified labels as well as the complexity of modeling sequence context. Previous work only focuses on semi-supervised learning (SSL) or pre-trained language models (PLMs) individually and does not consider their complementarity: whereas PLMs capture syntactic conservatism and long-range dependency from huge unlabeled sequence data as features, SSL updates label-level certainty via confidence-weighted pseudo-label selection. To realize integrating these benefits, we make a creative use of PLMs as strong feature extractors to make the sequence semantics extracted, and meanwhile leverage the SSL to constrain the decision boundary by selecting reliable pseudo-labels. We show that our framework achieves competitive performance compared to full supervision while using far fewer labeled samples, on two biological prediction tasks: DNA-binding protein (DBP) and non-coding RNA (ncRNA) detection. In general, our approach, developed here, not only offers an efficient solution for discovering new DBP and ncRNA in low-resource settings where experimental verification is infeasible, but also lays a solid methodological foundation for biological sequence classification in computational biology. Key points Unlabeled data significantly improves model accuracy in semi-supervised learning. The language model-based self-training method outperforms traditional semi-supervised approaches like TSVM. It achieves competitive performance against state-of-the-art fully supervised methods, requiring only 1% labeled data for effective classification and accurately identifying novel biomolecules.

Perguntar à IA

Bookmark

Perguntar à IA

Bookmark

Language model-based self-training reduces labeled data requirements by 99% for biological sequence classification

Key Points

Abstract

Cite This Study