What question did this study set out to answer?

The aim is to enhance Rubisco variant discovery using active learning to predict traits of interest.

April 17, 2026Open Access

Active Learning on Protein Language Model Embeddings Accelerates Rubisco Variant Discovery for Desired Traits

Key Points

The aim is to enhance Rubisco variant discovery using active learning to predict traits of interest.
Utilized protein language model embeddings to analyze Rubisco sequences.
Conducted supervised regression to predict phenotypic scores for Rubisco variants.
Applied deep mutational scanning and cyanobacterial screening datasets for model evaluations.
Implemented a model-guided selection process with active learning campaigns.
Active learning selects more top-performing mutants compared to random sampling.
Achieved better rank-based criteria (Spearman correlation, top 5% hit recovery) using TabPFN-2.5 model.
Rubisco embeddings predicted cyanobacterial doubling time and cross-species kinetic metrics effectively.

Abstract

Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) efficiency constrains carbon fixation, making it a high-value target in biotechnology. The core task of this work is a supervised regression and ranking problem on Rubisco: given a numerical representation of a protein sequence (a PLM embedding), we predict continuous phenotypic scores such as an enzyme kinetic proxy or fitness value. The predictions then guide which variants to test next. Engineering Rubisco is a point of focus but remains challenging due to selection forces in vivo and the combinatorial space of potential mutants for ex vivo uses. We combine protein language model (PLM) embeddings with tabular learning to model Rubisco variant landscapes in two regimes. First, we analyze deep mutational scanning data providing inferred kinetic proxies, including Km for CO2 and Vmax. Second, we model a cyanobacterial screening dataset measuring mutant fitness under differing oxygen and nitrogen regimes, enabling an oxygen tolerance objective. Across tasks, a tabular foundation (TabPFN-2.5) model outperforms gradient-boosted trees on rank-based criteria for variant prioritization, including Spearman correlation and top 5% hit recovery. We then simulate active-learning campaigns initialized with 200 measured variants and iteratively acquiring batches of 48. Model-guided selection recovers more top-performing mutants than random sampling at fixed experimental budgets, even with a conservative XGBoost surrogate. We also demonstrate that Rubisco large-subunit embeddings predict cyanobacterial doubling time and cross-species kinetic parameters, suggesting that Rubisco representation remains meaningful across organisms even with multi-objective cellular constraints. Together, these results support a practical, data-efficient workflow for enzyme engineering and motivate objective-aware design strategies that complement directed evolution.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Young et al. (Wed,) studied this question.

synapsesocial.com/papers/69e1cfcb5cdc762e9d858d24 https://doi.org/https://doi.org/10.3390/aichem1020007

Bookmark

View Full Paper