Abstract The prediction of a biological sequence’s function is an exploding field that could prove to be an invaluable tool in deciphering molecular mechanisms of diseases and genetic variation. However, existing techniques suffer from a very low number of experimentally verified labels as well as the complexity of modeling sequence context. Previous work only focuses on semi-supervised learning (SSL) or pre-trained language models (PLMs) individually and does not consider their complementarity: whereas PLMs capture syntactic conservatism and long-range dependency from huge unlabeled sequence data as features, SSL updates label-level certainty via confidence-weighted pseudo-label selection. To realize integrating these benefits, we make a creative use of PLMs as strong feature extractors to make the sequence semantics extracted, and meanwhile leverage the SSL to constrain the decision boundary by selecting reliable pseudo-labels. We show that our framework achieves competitive performance compared to full supervision while using far fewer labeled samples, on two biological prediction tasks: DNA-binding protein (DBP) and non-coding RNA (ncRNA) detection. In general, our approach, developed here, not only offers an efficient solution for discovering new DBP and ncRNA in low-resource settings where experimental verification is infeasible, but also lays a solid methodological foundation for biological sequence classification in computational biology. Key points Unlabeled data significantly improves model accuracy in semi-supervised learning. The language model-based self-training method outperforms traditional semi-supervised approaches like TSVM. It achieves competitive performance against state-of-the-art fully supervised methods, requiring only 1% labeled data for effective classification and accurately identifying novel biomolecules.
J et al. (Thu,) studied this question.