Translating scientific questions expressed in natural language into SPARQL queries that can be executed over knowledge graphs remains a significant challenge in the field of question answering. Recently, several prominent benchmarks, notably SciQA and DBLP-QuAD, have emerged to evaluate performance in this domain. In this paper, we provide a comprehensive analysis of the performance of language models on these benchmarks, assessing various optimization strategies. Our results indicate that the combined use of fine-tuning and prompting techniques, especially when incorporating strategic few-shot selection, produces excellent results on both benchmarks. These findings underscore an urgent need for more challenging benchmarks to better assess model capabilities. We identify key insights, common error patterns, and potential opportunities for transfer learning, and we discuss their implications for optimizing the performance of large language models in knowledge graph-based question answering tasks.
Meloni et al. (Fri,) studied this question.