What does this research mean for the field?

The proposed interpolation method generates substitute images that maintain or improve the zero-shot performance of Vision-Language Models (VLMs) despite missing images in training datasets. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To develop a method for generating substitute images to maintain the performance of vision-language models in the face of missing images.

March 7, 2026Open Access

Maintaining VLM Performance with Latent Optimization-Based Image Synthesis

Puntos clave

To develop a method for generating substitute images to maintain the performance of vision-language models in the face of missing images.
Proposed an interpolation method to generate images using latent variable optimization.
Utilized a Latent Diffusion Model to reflect characteristics of missing images based on associated text.
Applied the method to pretrain the vision-language model CLIP with generated substitute images.
Achieved comparable or improved zero-shot performance using generated images versus original datasets.
Demonstrated effectiveness in addressing issues of missing image links in large-scale datasets.

Resumen

This paper proposes a method for interpolating missing images in large-scale datasets used for training Vision-Language Models (VLMs). Recent large-scale datasets are often distributed not by hosting the image files directly on servers, but by providing CSV files that contain download links and the corresponding text for each image. As a result, many images become unavailable due to broken links, making it difficult to reproduce the VLM performance reported in previous studies. To address this issue, we propose an interpolation method that generates images reflecting the characteristics of the missing ones by optimizing the latent variables of a Latent Diffusion Model based on the associated text information. We applied this method to generate substitute images for pretraining a VLM, specifically CLIP, and confirmed that the resulting zero-shot performance was comparable to or even better than that obtained using the original dataset before image loss. These results demonstrate that the proposed method can serve as a practical approach for supplementing datasets with missing images.

Me gusta

Guardar

Ver artículo completo