Objectives/Goals: To evaluate the translational reliability, reproducibility, diagnostic performance, and subgroup equity of multimodal artificial intelligence (AI) models for dermatology triage across multiple model platforms. Methods/Study Population: Limited access to dermatology expertise delays diagnosis and care, motivating development of multimodal AI systems that integrate clinical images with patient data for triage. We assembled 200 biopsy-confirmed PAD-UFES-20 lesions (melanoma, keratinocyte carcinoma, benign) with paired images and metadata, prioritizing demographic balance. Six multimodal AI models (GPT-5, GPT-5-mini; Gemini 2.5 Pro, Gemini 2.5 Flash; Claude Sonnet-4, Claude Opus-4) analyzed these lesions with identical prompts predicting diagnostic probabilities, triage (urgent vs routine), and rationale. Outcomes included sensitivity, specificity, AUROC, F1, and subgroup equity. Model rationales were reviewed for interpretability, and subset re-prompting tested reproducibility for translational robustness. Results/Anticipated Results: Across six models, sensitivity range was 0.89–1.00, specificity 0.21–0.65, AUROC 0.77–0.87, and F1 scores 0.72–0.81. GPT-5 achieved the most balanced performance (0.92 sensitivity, 0.65 specificity, AUROC 0.87, F1 0.81), while Gemini 2.5 Pro and Flash reached perfect sensitivity but low specificity (0.21–0.25). Claude Sonnet-4 showed near-perfect sensitivity (0.99) but over-called benign cases (0.24 specificity), while Opus-4 had the lowest sensitivity (0.89). Urgent triage aligned with dermatologist biopsy patterns (87–97%), and sensitivity was consistent across sex and skin type (p ≥ 0.29). Subset re-prompting produced similar results, supporting reproducibility. Model rationales reflected dermatologic reasoning, supporting interpretability, and translational readiness. Discussion/Significance of Impact: Multimodal AI models showed balanced diagnostic performance for dermatology triage, with platform-specific trade-offs between sensitivity and specificity. Subgroup equity, interpretable rationales, and subset reproducibility define key elements for reliable translation into dermatology workflows and prospective validation.
Golbasi et al. (Wed,) studied this question.