Chromatin looping, which facilitates the three-dimensional (3D) organization of the genome, is essential for the regulation of gene expression. This process relies on the interaction of numerous transcription factors (TFs), particularly CCCTC-binding factor (CTCF) and Cohesin, whose dynamic binding patterns orchestrate loop formation. Current computational methods for prediction of CTCF-mediated chromatin loops struggle to perform genome-wide predictions, primarily due to the extreme imbalance between positive and negative samples in training datasets. Existing DNA-sequence-based models often fail to capture the complex dynamics of TF binding and the regulatory code behind chromatin looping. To address these challenges, we present TF-loop, a novel TF regulatory language framework designed to predict chromatin loops. This framework conceptualizes TF sequences, defined by the binding positions and orientations of five key TFs, as a structured "TF language." Using the BERT model, TF-loop decodes the latent linguistic patterns embedded in these sequences, facilitating accurate predictions of chromatin loops. Comparative analysis with state-of-the-art model demonstrates that TF-loop significantly improves prediction accuracy across diverse cell types, even when faced with highly imbalanced datasets. The results highlight the potential of TF-loop to offer a new perspective on decoding the 3D structure of chromatin using natural language processing techniques.
Qi et al. (Sun,) studied this question.