This thesis explores the Natural Language to SQL (NL2SQL) task from three perspectives: few-shot selection, database optimization, and bench mark error detection with rule-based validation. A key focus is to investigate whether the performance of large language models (LLMs) has been under estimated in NL2SQL tasks. We argue that this underestimation mainly stems from three factors: (1) LLM hallucination, errors in benchmark gold answers and limitations in evaluation systems. Experimental results show that under the original 3-shot DAIL-SQL setting, the model achieves an execution accuracy of 80.9%. After applying a series of corrections, including benchmark error fixing and rule-based validation, the accuracy improves significantly to 87.5%. These findings suggest that LLMs are substantially more capable in NL2SQL tasks than previously perceived. Furthermore, we identify a gap between LLM-based models and real-world deployment scenarios. To bridge this gap, we extend DAIL-SQL by developing a practical VS Code plugin that integrates our optimization strategies, enabling more robust and user-friendly NL2SQL applications in real-world settings.
Haozhe Xu (Thu,) studied this question.