Protein embedding is a protein representation that carries along the information derived from filtering large volumes of sequences stored in large archives. Routinely, the protein is represented by a matrix in which each residue is a context-specific vector whose dimensions reflect the size of the large architectures of neural networks (transformers) trained with deep learning algorithms on large volumes of sequences. A recently introduced method (Embedding-Based Alignment, EBA) is particularly suited for pairwise embedding comparisons and, as we report here, allows for remote homolog detection under specific constraints, including protein sequence length similarity. Multifunctional proteins are present in different species. However, particularly in humans, the problem of their structural and functional annotation is urgent since, according to recent statistics, they comprise up to 50% of the human reference proteome. In this paper we show that when EBA is applied to a set of randomly selected multifunctional human proteins, it retrieves, after a clustering procedure and rigorous validation on the reference Swiss-Prot database, proteins that are remote homologs to each other and carry similar structural and functional features as the query protein.
Vazzana et al. (Tue,) studied this question.