What question did this study set out to answer?

This research aims to evaluate the effectiveness of Retrieval-Augmented Generation (RAG) in enhancing small language models for visual question answering.

April 11, 2026

Optimizing Retrieval-Augmented Generation for Small Language Models via Output Alignment

Key Points

This research aims to evaluate the effectiveness of Retrieval-Augmented Generation (RAG) in enhancing small language models for visual question answering.
Developed a light multimodal RAG (MM-RAG) pipeline on consumer-grade hardware.
Compared two small language models: TinyLlama (1.1B) and Qwen 2.5 (3B).
Implemented a post-processing protocol to improve output accuracy.
Achieved a significant accuracy increase of 13%-16% with RAG compared to zero-shot baselines.
Identified verbosity failure in instruction-tuned small language models that leads to low evaluation scores.
Restored Qwen's accuracy from 8% to 52.6% using the developed post-processing protocol.

Abstract

Traditional Visual Question Answering (VQA) models have a significant limitation: they cannot remember the latest external world knowledge, nor use it. Although Retrieval-Augmented Generation (RAG) can solve the problems of hallucination, current researches are focuses on the massive, computation-heavy models, which prevents us from knowing whether RAG is effective in resource constrained environment. This paper presents a comprehensive study of building a light multi modal RAG (MM-RAG) pipeline on consumer-grade hardware. Specifically, this study made a comparative study between two small language models: TinyLlama (1.1B) and Qwen 2.5(3B). The results of research demonstrate that, first, with the access of RAG, the accuracy of outcome results has a significant increase (of about 13%-16%) compared to zero-shot baselines, despite the scale of models. Second, which is the most important, this study identifies a verbosity failure of those instruction-tuned small language models (SLMs), which are more likely to generate output noise and leads to low evaluation marks. To address this, this research developed a post processing protocol, recovering the accuracy of Qwen from 8% to 52.6%. These findings illustrate that for edge-deployed VQA systems, rigorous output alignment is as critical as the retrieval mechanism itself.

Bookmark

Cite This Study

Runkai Dong (Mon,) studied this question.

synapsesocial.com/papers/69d9e64e78050d08c1b76a17 https://doi.org/https://doi.org/10.1051/itmconf/20268403023/pdf

Bookmark