Abstract Protein–protein interactions (PPIs) perform a key role in virtually all cellular processes. However, experimental identification of PPIs remains costly, time-consuming, and often incomplete. To address these challenges, this study presents a hybrid adaptive framework for PPI prediction that integrates modern protein language models with evolutionary optimization and ensemble learning. It uses the language model Prot-T5-XL-Uniref-50 to embed protein sequences, capturing rich contextual, structural, and physicochemical information. The resulting high-dimensional representations are then compressed using uniform manifold approximation and projection to reduce computational complexity. A hybrid approach coupling the multi-objective non-dominated sorting genetic algorithm-II (NSGA-II) with random forest is then proposed to enhance classifier robustness. This evolutionary strategy simultaneously maximizes prediction accuracy and classifier diversity while estimating the optimal number of trees required for the ensemble from the pareto-optimal fronts. Comparative results with state-of-the-art methods validate the superior performance of the proposed method across four benchmark datasets- Human , E. coli , Drosophila , and C. elegans . Finally, using SHapley Additive exPlanations, each feature’s contribution to the model’s predictions was quantified and visualized, facilitating the ranking and examination of influential embedding dimensions. Overall, the proposed framework offers a reliable and robust solution for large-scale PPI prediction based solely on protein sequence data.
Chatterjee et al. (Fri,) studied this question.