Traditional Visual Question Answering (VQA) models have a significant limitation: they cannot remember the latest external world knowledge, nor use it. Although Retrieval-Augmented Generation (RAG) can solve the problems of hallucination, current researches are focuses on the massive, computation-heavy models, which prevents us from knowing whether RAG is effective in resource constrained environment. This paper presents a comprehensive study of building a light multi modal RAG (MM-RAG) pipeline on consumer-grade hardware. Specifically, this study made a comparative study between two small language models: TinyLlama (1.1B) and Qwen 2.5(3B). The results of research demonstrate that, first, with the access of RAG, the accuracy of outcome results has a significant increase (of about 13%-16%) compared to zero-shot baselines, despite the scale of models. Second, which is the most important, this study identifies a verbosity failure of those instruction-tuned small language models (SLMs), which are more likely to generate output noise and leads to low evaluation marks. To address this, this research developed a post processing protocol, recovering the accuracy of Qwen from 8% to 52.6%. These findings illustrate that for edge-deployed VQA systems, rigorous output alignment is as critical as the retrieval mechanism itself.
Runkai Dong (Mon,) studied this question.