What question did this study set out to answer?

The aim is to assess large language models' capabilities in formal logical reasoning through a new benchmark.

February 14, 2026Open Access

DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models

Key Points

The aim is to assess large language models' capabilities in formal logical reasoning through a new benchmark.
Developed DL-ReasonSuite benchmark with 4740 reasoning tasks across three tracks: DLCore, DLQuery, DLBridge.
Employed automatic evaluation metrics for task validation using reliable OWL reasoners.
Conducted empirical evaluation of five leading reasoning language models on the benchmark.
Best model, Phi4 Reasoning Plus, achieved 85% accuracy, excelling in tool-augmented tasks.
Notable performance variability observed among models, particularly in complex query reasoning and OWL translation.
Insights into limitations such as handling SPARQL queries and maintaining logical consistency were identified.

Abstract

Large language models (LLMs) have shown remarkable progress in general reasoning and understanding, but their ability to perform formal logical reasoning remains under-explored. In this paper, we introduce DLReasonSuite, a novel benchmark designed to rigorously evaluate LLMs on reasoning tasks grounded in Description Logic (DL). DL-ReasonSuite comprises 4740 tasks spanning seven distinct task types and organized into three reasoning tracks: (1) DLCore, covering fundamental ontology reasoning tasks (consistency checking, subsumption, and instance checking); (2) DLQuery, focusing on answering entailment-aware SPARQL queries; and (3) DLBridge, bridging natural language and formal logic (bidirectional NL ↔ OWL translation and tool-augmented entailment resolution). We detail the methodology for designing and implementing this benchmark, including task construction, automatic evaluation metrics and validation using reliable OWL reasoners. Then, we present an empirical evaluation of five leading reasoning LLMs as stateofart models: Kimi k1.5, LlamaNemotron Ultra, DeepSeekR1, Phi4 Reasoning Plus, and Phi4 Reasoning on the full suite of tasks. Our results reveal significant variability in LLM performance on formal reasoning was observed. While the best model, Phi4 Reasoning Plus, achieves an overall accuracy of 85% and excels especially in tool-augmented tasks, other models struggle notably with complex query reasoning for DL and precise OWL translation. We analyze the strengths and weaknesses of each model across different DL metrics and task categories, providing insights into current limitations of LLM reasoning such as handling SPARQL queries and maintaining logical consistency and the benefits of neuro-symbolic techniques. DL-ReasonSuite is a comprehensive framework for assessing and advancing LLMs’ Description Logic reasoning capabilities aiming to bridge the gap between natural language understanding and formal knowledge representation.

Bookmark

View Full Paper

Bookmark

View Full Paper

DL-ReasonSuite: A Benchmark for Evaluating Description Logic Reasoning in Large Language Models

Key Points

Abstract

Cite This Study