Abstract Artificial intelligence (AI) opens new possibilities for processing and analysing large, heterogeneous historical data corpora in a semi-automated way. The Ottoman Nature in Travelogues (ONiT) project applies a fine-tuned Contrastive Language–Image Pre-Training (CLIP) model for retrieving images with nature representations in digitized early book prints based on embeddings of visual features rather than on textual metadata. In this article, we present results of our work, including a curated and annotated dataset of 8,042 images of nature representations, and the CLIP-based text–image exploration tool ONiT Explorer. An evaluation of the fine-tuned model comparing it to the zero-shot model confirms the potential of vision-language models for retrieving specific contents from large image collections in the cultural heritage and digital humanities domains. While in general our fine-tuned model can retrieve more correct examples per class compared to the zero-shot model, our analysis also reveals some limitations that need to be addressed in future explorations.
Vignoli et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: