Large language models (LLMs) have shown remarkable progress in general reasoning and understanding, but their ability to perform formal logical reasoning remains under-explored. In this paper, we introduce DLReasonSuite, a novel benchmark designed to rigorously evaluate LLMs on reasoning tasks grounded in Description Logic (DL). DL-ReasonSuite comprises 4740 tasks spanning seven distinct task types and organized into three reasoning tracks: (1) DLCore, covering fundamental ontology reasoning tasks (consistency checking, subsumption, and instance checking); (2) DLQuery, focusing on answering entailment-aware SPARQL queries; and (3) DLBridge, bridging natural language and formal logic (bidirectional NL ↔ OWL translation and tool-augmented entailment resolution). We detail the methodology for designing and implementing this benchmark, including task construction, automatic evaluation metrics and validation using reliable OWL reasoners. Then, we present an empirical evaluation of five leading reasoning LLMs as stateofart models: Kimi k1.5, LlamaNemotron Ultra, DeepSeekR1, Phi4 Reasoning Plus, and Phi4 Reasoning on the full suite of tasks. Our results reveal significant variability in LLM performance on formal reasoning was observed. While the best model, Phi4 Reasoning Plus, achieves an overall accuracy of 85% and excels especially in tool-augmented tasks, other models struggle notably with complex query reasoning for DL and precise OWL translation. We analyze the strengths and weaknesses of each model across different DL metrics and task categories, providing insights into current limitations of LLM reasoning such as handling SPARQL queries and maintaining logical consistency and the benefits of neuro-symbolic techniques. DL-ReasonSuite is a comprehensive framework for assessing and advancing LLMs’ Description Logic reasoning capabilities aiming to bridge the gap between natural language understanding and formal knowledge representation.
Olucoglu et al. (Thu,) studied this question.