What question did this study set out to answer?

The aim is to improve image captioning by balancing model size with output quality using a LLaMA-based approach.

February 14, 2026Open Access

A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts

Key Points

The aim is to improve image captioning by balancing model size with output quality using a LLaMA-based approach.
Developed a Multi-Feature Dynamic Instruction Tuning (MFDIT) model based on LLaMA.
Combined CLIP-based global features with SAM-derived local features for a multi-level visual representation.
Implemented a Dynamic Prompt Adapter for cross-modal semantic alignment.
Applied Low-Rank Adaptation (LoRA) for fine-tuning with only 20 million parameters.
Achieved a CIDEr score of 126.7 on the MSCOCO dataset.
Outperformed traditional models by 3.0 points in image captioning.
Improved performance by 7.3% and 3.8% in OCR and object counting tasks on the MME Benchmark.

Abstract

To address the trade-off between parameter scale and generation quality in Vision-Language Models (VLMs), this study proposes a Multi-Feature Dynamic Instruction Tuning (MFDIT) image captioning model based on LLaMA. By integrating CLIP-based global features with SAM-derived local features, the model constructs a multi-level visual representation. Additionally, a Dynamic Prompt Adapter is designed to enable cross-modal semantic alignment with adaptive flexibility. Combined with a Low-Rank Adaptation (LoRA) fine-tuning strategy, the proposed method enhances the model’s capability in describing diverse images while training only 20 million parameters, accounting for merely 0.05% of the total parameter volume. Experimental results demonstrate that the model achieves a CIDEr score of 126.7 on the MSCOCO dataset, surpassing traditional adapter-based approaches by 3.0 points. Moreover, in the MME Benchmark evaluation, the proposed model outperforms the mainstream LLaMA-Adapter V2 by 7.3% and 3.8% in OCR and object counting tasks, respectively. Ablation studies further validate the synergistic effects of multi-feature fusion and dynamic instruction optimization. This research provides an efficient solution for parameter-efficient multimodal model training and potential deployment in resource-constrained environments.

Bookmark

View Full Paper

Bookmark

View Full Paper

A LLaMA-Based Efficient Fine-Tuning Method for Image Captioning Using Multi-Feature Dynamic Prompts

Key Points

Abstract

Cite This Study