To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level visual representation. Additionally, a Dynamic Prompt Adapter is designed to enable cross-modal semantic alignment with adaptive flexibility. Combined with a Low-Rank Adaptation (LoRA) fine-tuning strategy, the proposed method enhances the model’s capability in describing diverse images while training only 20 million parameters, accounting for merely 0.05% of the total parameter volume. Experimental results demonstrate that the model achieves a CIDEr score of 126.7 on the MSCOCO dataset, surpassing traditional adapter-based approaches by 3.0 points. Moreover, in the MME Benchmark evaluation, the proposed model outperforms the mainstream LLaMA-Adapter V2 by 7.3% and 3.8% in OCR and object counting tasks, respectively. Ablation studies further validate the synergistic effects of multi-feature fusion and dynamic instruction optimization. This research provides an efficient solution for parameter-efficient multimodal model training and potential deployment in resource-constrained environments.
Yin et al. (Thu,) studied this question.