Key points are not available for this paper at this time.
This paper presents VisualSiteDiary, a Vision Transformer-based image captioning model which creates human-readable captions for daily progress and work activity log, and enhances image retrieval tasks. As a model for deciphering construction photologs, VisualSiteDiary incorporates pseudo-region features, utilizes high-level knowledge in pretraining, and fine-tunes for diverse captioning styles. To validate VisualSiteDiary, a new image captioning dataset, VSD, is presented. This dataset includes many realistic yet challenging cases commonly observed in commercial building projects. Experimental results using five different metrics demonstrate that VisualSiteDiary provides superior-quality captions compared to the state-of-the-art image captioning models. Excluding the task of object recognition, the presented model also outperformed mPLUG –the state-of-the-art visual-language model– in the image retrieval task by 0.6% in precision and 0.9% in recall, respectively. Detailed discussions illustrate practical examples on how VisualSiteDiary improves the process of creating daily construction reports, paving the way for future developments in the field.
Jung et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: