The classical Chinese language is characterized by a high density of meaning, wide use of polysemy, and strong dependence on history and culture, which pose challenges to existing large language models (LLMs). Retrieval-augmented generation (RAG) technology has become a prevailing option that could address these issues without retraining the model, but most of the existing RAG systems regard structured tables as unstructured text, encoding a whole table into one vector. Such a schema usually hides the row-level semantic information and raises the reasoning cost for LLMs. In this study, we propose a new table-aware row-wise retrieval system in which each row of a table is treated as an individual semantic unit, explicitly (instead of implicitly) reasoning at generation time. We organize the table into row-level vector representations, which makes retrieval more deterministic and semantically interpretable, in particular, for pedagogical or philological datasets. Based on LangChain and integrated with Qwen LLMs, our system can be evaluated experimentally for classical Chinese learning tasks, where we find that compared with the traditional RAG systems, this system improves on retrieval performance, semantic consistency, and explainability, with no model training or extra computation time required.
Liu et al. (Thu,) studied this question.