March 3, 2026Open Access

Diagnostic Accuracy of Artificial Intelligence Applications on a Diverse Skin Image Set

Key Points

All applications demonstrated low diagnostic accuracy, with an overall accuracy rate of 22%.
The average sensitivity for classifying benign and malignant diagnoses was found to be 46.57%.
Analysis used diverse skin images from the Stanford Diverse Dermatology Images database for testing apps.
These AI apps are not reliable as standalone diagnostic tools for skin lesions.

Abstract

Background Several new mobile applications (apps) have been developed that utilize artificial intelligence (AI) to diagnose skin lesions. Objective The goal of this study was to evaluate the diagnostic accuracy of the most popular smartphone apps using a database of skin lesion images with diverse skin tones. An additional goal was to measure the apps' sensitivity and specificity in detecting skin cancer. Methods A thorough search was performed in the Google Play Store and Apple App Store to find the most popular skin apps that diagnose skin lesions. We used the Stanford Diverse Dermatology Images database (DDI) to test the accuracy of the following apps: ChatGPT (OpenAI, San Francisco, CA, USA), AI skin scanner Rash Detector (by I Lov Guitars Inc., Scarborough, ON), Rash ID (Appsmiths LLC, Canton, MS USA), and Skin Scanner Dermatology & Acne (ACINA, UAB, located at Krokuvos, Vilnius, Lithuania). One hundred and two images with a range of diagnoses were selected for upload to each app. Fifty-one images were malignant, and 51 were benign. We also trained a new model of ChatGPT using a separate set of 554 images from the same database. Results All the apps had low diagnostic accuracy. The overall accuracy was 22%. When classifying benign versus malignant diagnoses, the apps had an average sensitivity of 46.57% and an average specificity of 72.06%. The average positive predictive value was 67.44%, and the average negative predictive value was 58.06%. In our study, training ChatGPT did not improve its diagnostic accuracy. Conclusions ChatGPT, Rash Detector, Rash ID, and Skin Scanner Dermatology & Acne performed poorly at diagnosing skin lesions from a database with diverse skin tones. These apps should not be used as stand-alone diagnostic tools.

Bookmark

View Full Paper

Cite This Study

Shah et al. (Mon,) studied this question.

synapsesocial.com/papers/69a75a90c6e9836116a208c5 https://doi.org/https://doi.org/10.7759/cureus.102354

Bookmark

View Full Paper