Key points are not available for this paper at this time.
Degeneracy in the genetic code allows many possible DNA sequences to encode the same protein. Optimizing codon usage within a sequence to meet organism-specific preferences faces combinatorial explosion. Nevertheless, natural sequences optimized through evolution provide a rich source of data for machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all domains of life. The model demonstrates context-awareness thanks to its Transformers architecture and to our sequence representation strategy that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with minimum negative cis-regulatory elements. This work introduces the strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a codon optimization framework with a customizable open-access model and a user-friendly Google Colab interface.
Building similarity graph...
Analyzing shared references across papers
Loading...
Adibvafa Fallahpour
University Health Network
Vincent Gureghian
Centre National de la Recherche Scientifique
Guillaume J. Filion
The Scarborough Hospital
Nature Communications
Centre National de la Recherche Scientifique
Inserm
Sorbonne Université
Building similarity graph...
Analyzing shared references across papers
Loading...
Fallahpour et al. (Thu,) studied this question.
synapsesocial.com/papers/6a02630c27fccfd929bd1ffd — DOI: https://doi.org/10.1038/s41467-025-58588-7