The protein universe remains only partially explored, with many protein families and functions yet to be discovered. Leveraging large-scale protein sequence and structure datasets, we recently developed an unsupervised representation of this landscape that reveals functionally relevant clusters across millions of proteins. This data-driven approach enables the prioritization and characterization of unknown protein families, providing candidates for experimental validation. Using this framework, we uncovered novel biology at an unprecedented scale, including previously unknown prokaryotic defense systems and a new protein fold. This work underscores the power of AI-guided discovery and lays the foundation for a dynamic, continually evolving atlas of the protein universe, accelerating our understanding of molecular function and evolution across life.
Joana Pereira (Thu,) studied this question.