What question did this study set out to answer?

The study aims to assess the reliability and diagnostic performance of multimodal AI models for dermatology triage.

May 9, 2026Open Access

57 Translational evaluation of multimodal artificial intelligence for dermatology triage

Key Points

The study aims to assess the reliability and diagnostic performance of multimodal AI models for dermatology triage.
Analyzed 200 biopsy-confirmed PAD-UFES-20 lesions using six AI models.
Evaluated performance metrics including sensitivity, specificity, and AUROC.
Tested reproducibility through subset re-prompting for translational robustness.
GPT-5 exhibited the highest balanced performance with sensitivity of 0.92 and specificity of 0.65.
Gemini models achieved perfect sensitivity (1.00) but had low specificity (0.21–0.25).
Dermatologist alignment with urgent triage ranged from 87% to 97% across models.

Abstract

Objectives/Goals: To evaluate the translational reliability, reproducibility, diagnostic performance, and subgroup equity of multimodal artificial intelligence (AI) models for dermatology triage across multiple model platforms. Methods/Study Population: Limited access to dermatology expertise delays diagnosis and care, motivating development of multimodal AI systems that integrate clinical images with patient data for triage. We assembled 200 biopsy-confirmed PAD-UFES-20 lesions (melanoma, keratinocyte carcinoma, benign) with paired images and metadata, prioritizing demographic balance. Six multimodal AI models (GPT-5, GPT-5-mini; Gemini 2.5 Pro, Gemini 2.5 Flash; Claude Sonnet-4, Claude Opus-4) analyzed these lesions with identical prompts predicting diagnostic probabilities, triage (urgent vs routine), and rationale. Outcomes included sensitivity, specificity, AUROC, F1, and subgroup equity. Model rationales were reviewed for interpretability, and subset re-prompting tested reproducibility for translational robustness. Results/Anticipated Results: Across six models, sensitivity range was 0.89–1.00, specificity 0.21–0.65, AUROC 0.77–0.87, and F1 scores 0.72–0.81. GPT-5 achieved the most balanced performance (0.92 sensitivity, 0.65 specificity, AUROC 0.87, F1 0.81), while Gemini 2.5 Pro and Flash reached perfect sensitivity but low specificity (0.21–0.25). Claude Sonnet-4 showed near-perfect sensitivity (0.99) but over-called benign cases (0.24 specificity), while Opus-4 had the lowest sensitivity (0.89). Urgent triage aligned with dermatologist biopsy patterns (87–97%), and sensitivity was consistent across sex and skin type (p ≥ 0.29). Subset re-prompting produced similar results, supporting reproducibility. Model rationales reflected dermatologic reasoning, supporting interpretability, and translational readiness. Discussion/Significance of Impact: Multimodal AI models showed balanced diagnostic performance for dermatology triage, with platform-specific trade-offs between sensitivity and specificity. Subgroup equity, interpretable rationales, and subset reproducibility define key elements for reliable translation into dermatology workflows and prospective validation.

Bookmark

View Full Paper

Bookmark

View Full Paper

57 Translational evaluation of multimodal artificial intelligence for dermatology triage

Key Points

Abstract

Cite This Study