This study evaluates Large Language Models (LLMs) capabilities in processing Shariah-related queries within Islamic finance. We introduce a three-part benchmark framework. First, a multiple-choice dataset testing factual knowledge. Second, a vulnerability dataset assessing resistance to erroneous fatwas. Third, an applied reasoning dataset evaluating usul al-fiqh methodology. Six models, including ChatGPT, Claude, and a domain-aligned Islamic model, were tested. Results confirm that LLMs are unqualified to issue new Islamic legal rulings, showing susceptibility to theological drift under adversarial prompting. Models also struggled to reliably apply established rulings to familiar scenarios, displaying weaknesses in legal maxims and cross-school reasoning. However, several models demonstrated utility in factual retrieval and the summarization of Islamic finance concepts. This framework provides the first structured benchmark for evaluating Islamic finance AI applications.
Al-Syed et al. (Wed,) studied this question.