Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) efficiency constrains carbon fixation, making it a high-value target in biotechnology. The core task of this work is a supervised regression and ranking problem on Rubisco: given a numerical representation of a protein sequence (a PLM embedding), we predict continuous phenotypic scores such as an enzyme kinetic proxy or fitness value. The predictions then guide which variants to test next. Engineering Rubisco is a point of focus but remains challenging due to selection forces in vivo and the combinatorial space of potential mutants for ex vivo uses. We combine protein language model (PLM) embeddings with tabular learning to model Rubisco variant landscapes in two regimes. First, we analyze deep mutational scanning data providing inferred kinetic proxies, including Km for CO2 and Vmax. Second, we model a cyanobacterial screening dataset measuring mutant fitness under differing oxygen and nitrogen regimes, enabling an oxygen tolerance objective. Across tasks, a tabular foundation (TabPFN-2.5) model outperforms gradient-boosted trees on rank-based criteria for variant prioritization, including Spearman correlation and top 5% hit recovery. We then simulate active-learning campaigns initialized with 200 measured variants and iteratively acquiring batches of 48. Model-guided selection recovers more top-performing mutants than random sampling at fixed experimental budgets, even with a conservative XGBoost surrogate. We also demonstrate that Rubisco large-subunit embeddings predict cyanobacterial doubling time and cross-species kinetic parameters, suggesting that Rubisco representation remains meaningful across organisms even with multi-objective cellular constraints. Together, these results support a practical, data-efficient workflow for enzyme engineering and motivate objective-aware design strategies that complement directed evolution.
Young et al. (Wed,) studied this question.