This paper proposes a method for interpolating missing images in large-scale datasets used for training Vision-Language Models (VLMs). Recent large-scale datasets are often distributed not by hosting the image files directly on servers, but by providing CSV files that contain download links and the corresponding text for each image. As a result, many images become unavailable due to broken links, making it difficult to reproduce the VLM performance reported in previous studies. To address this issue, we propose an interpolation method that generates images reflecting the characteristics of the missing ones by optimizing the latent variables of a Latent Diffusion Model based on the associated text information. We applied this method to generate substitute images for pretraining a VLM, specifically CLIP, and confirmed that the resulting zero-shot performance was comparable to or even better than that obtained using the original dataset before image loss. These results demonstrate that the proposed method can serve as a practical approach for supplementing datasets with missing images.
OHKUBO et al. (Wed,) studied this question.