Abstract Backgrounds: Accurate prediction of drug-target affinity (DTA) is a key step in drug discovery, but experimental measurement remains costly and time-consuming. Deep learning-based approaches have recently shown strong performance using sequence- or graph-level representations of drugs and proteins, and attention mechanisms provide interpretability by highlighting key molecular subsequences. However, conventional attention computes dependencies between single token pairs, which cannot fully capture the cooperative and multi-residue binding behaviors that characterize real protein-ligand interactions. Methods: We propose a Multi-Token Attention (MTA) -based DTA model that models both global and local interaction contexts between drug and target proteins. The model was trained and evaluated on the widely used Davis and KIBA benchmark datasets for binding affinity prediction. Protein sequences and drug SMILES were both embedded using pretrained models - ESM2 (esm2ₜ33₆50MUR50D) for proteins and ChemBERTA (ChemBERTAa-77M-MTR) for compounds. The MTA module incorporates key-wise convolution to aggregate local token neighborhoods before attention calculation and head-mixing convolution to fuse inter head information, allowing multi-token contextual dependencies. Results: Across two benchmark datasets, MTA-DTA consistently outperformed prior methods. On Davis, it achieved an MSE of 0. 225, R2 of 0. 7155, and CI of 0. 8796. On KIBA, it reached an MSE of 0. 1612, R2 of 0. 7661, and CI of 0. 8781, exceeding the performance of existing DTA models. In addition, qualitative analyses of the attention maps revealed that MTA captured biologically meaningful binding regions, highlighting continuous residue segments and sub-structural motifs rather than isolated token pairs. This indicates that the model effectively integrates both local chemical context and global interaction patterns, leading to improved accuracy and interpretability in drug-target affinity prediction. Conclusion: In summary, the MTA-DTA model enhances conventional attention by extending it from single-token to multi-token interaction modeling, thereby bridging the gap between sequence-based learning and the true cooperative nature of molecular binding. By incorporating key-wise and head-mixing convolutions, the model gains the ability to represent local interaction motifs while maintaining global dependency awareness, resulting in more biologically realistic affinity predictions. Citation Format: Il-san Jeong, Seung-Woo Baek, Jee-Woo Seo, Yeo-Gyeong Yoon, Jae-Yoon Kim, Seon-Young Kim, Seon-Kyu Kim. MTA-DTA: A multi-token attention framework for drug-target binding affinity prediction abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts) ; 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86 (7 Suppl): Abstract nr 981.
Jeong et al. (Fri,) studied this question.