What question did this study set out to answer?

This research aims to evaluate how well transformer-based language models can translate natural language into first-order logic, focusing on compositional generalization.

June 15, 2026Open Access

A Diagnostic Framework for Compositional Generalization via NL-to-FOL Translation

Key Points

This research aims to evaluate how well transformer-based language models can translate natural language into first-order logic, focusing on compositional generalization.
Introduced a targeted evaluation framework for assessing translation abilities.
Fine-tuned two sizes of the T5 language model using a custom dataset.
Conducted three experiments with four task-specific evaluation metrics.
Models achieved high scores on test data similar in complexity to training data.
Performance significantly declined as sentence length and structural complexity increased.
Models struggled to generalize with increased complexity from repeated truth-functional connectives.

Abstract

Abstract With the emergence of large language models and their impressive performance across diverse natural language processing tasks, the question of whether connectionist models can exhibit compositionality without relying on symbolic processing has regained attention in both cognitive science and artificial intelligence. However, interpretability challenges faced by neural networks make it difficult to determine whether they genuinely generalize compositional structures. In this paper, we introduce a targeted evaluation framework designed to directly assess the ability of transformer-based language models to translate natural language sentences into first-order logic expressions, a task that requires both nuanced linguistic understanding and compositional generalization. To demonstrate our framework, we fine-tune two different sizes of the T5 language model using our dataset, evaluating their performance through three experiments that employ four task-specific evaluation metrics. Our findings reveal that while these models achieve high scores on test data sharing the logical and structural complexity of the training set, their performance drops markedly as sentence length, the number of truth-functional connectives and predicates, and the depth of hierarchical composition increase. More strikingly, the models fail to generalize even when complexity increases solely through repeated applications of a single truth-functional connective.

Mark Helpful

Bookmark

Relay

View Full Paper