What question did this study set out to answer?

This research aims to develop MetaDent to enhance vision-language model applications in dentistry through annotated clinical images and benchmark datasets.

March 18, 2026Open Access

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Key Points

This research aims to develop MetaDent to enhance vision-language model applications in dentistry through annotated clinical images and benchmark datasets.
Created a large-scale dataset of 60,669 dental images from various sources.
Developed a semistructured annotation framework for detailed image labeling.
Generated around 15,000 visual question answering pairs and an 18-class classification dataset validated by human review.
Evaluated state-of-the-art vision-language models on tasks like VQA, classification, and image captioning.
Advanced models showed less than 70% accuracy in visual question answering tasks.
Inconsistent or incomplete descriptions were observed in image captioning.
Findings highlight a significant gap between general-purpose and specialized vision-language models.

Abstract

Vision-language models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes 1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; 2) a semistructured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and 3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging large language models (LLMs), we derive standardized benchmarks: approximately 15,000 visual question answering (VQA) pairs and an 18-class multilabel classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy (e.g., less than 70% in VQA) and producing inconsistent or incomplete descriptions in image captioning. These findings underscore the gap between general-purpose VLMs and the demands of specialized models, highlighting the need for domain-adapted training and more sophisticated evaluation protocols to assist professional dental practice and community oral health efforts. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Key Points

Abstract

Cite This Study