Abstract With the emergence of large language models and their impressive performance across diverse natural language processing tasks, the question of whether connectionist models can exhibit compositionality without relying on symbolic processing has regained attention in both cognitive science and artificial intelligence. However, interpretability challenges faced by neural networks make it difficult to determine whether they genuinely generalize compositional structures. In this paper, we introduce a targeted evaluation framework designed to directly assess the ability of transformer-based language models to translate natural language sentences into first-order logic expressions, a task that requires both nuanced linguistic understanding and compositional generalization. To demonstrate our framework, we fine-tune two different sizes of the T5 language model using our dataset, evaluating their performance through three experiments that employ four task-specific evaluation metrics. Our findings reveal that while these models achieve high scores on test data sharing the logical and structural complexity of the training set, their performance drops markedly as sentence length, the number of truth-functional connectives and predicates, and the depth of hierarchical composition increase. More strikingly, the models fail to generalize even when complexity increases solely through repeated applications of a single truth-functional connective.
Ibrahim Ethem Deveci (Sat,) studied this question.