Deep Learning-Based, Computer-Aided Classifier Developed with Dermoscopic Images Shows Comparable Performance to 164 Dermatologists in Cutaneous Disease Diagnosis in the Chinese Population
In China, the diagnosis of skin diseases is frequently delayed due to a severe shortage of dermatologists. The dermatologist-to-patient ratio is as low as 1:60,000, with most well-trained and experienced dermatologists concentrated in large cities. This scarcity is particularly acute in rural areas, where limited clinical experience and learning opportunities among general physicians often lead to misdiagnoses or delayed treatments. To address this issue, a deep learning-based diagnosis support system has been developed to facilitate pre-screening of patients, thereby prioritizing dermatologists’ efforts and improving diagnostic accuracy. This study evaluates the classification sensitivity and specificity of deep learning models in diagnosing skin tumors and psoriasis in the Chinese population using a relatively modest number of dermoscopic images.
The study developed a convolutional neural network (CNN) using two datasets from patients who underwent dermoscopy at the Department of Dermatology, Peking Union Medical College Hospital, between 2016 and 2018. Dataset I consisted of 7,192 dermoscopic images for a multi-class model designed to differentiate the three most common skin tumors—basal cell carcinoma (BCC), melanocytic nevus (MN), and seborrheic keratosis (SK)—from other diseases. Dataset II included 3,115 dermoscopic images for a two-class model aimed at classifying psoriasis versus other inflammatory diseases. The performance of the CNN was compared to that of 164 dermatologists in a reader study involving 130 dermoscopic images. The reference standard for diagnosis was expert consensus, except for BCC cases, which were confirmed by histopathology.
The results demonstrated that the multi-class model achieved an accuracy of 81.49% ± 0.88%, while the two-class model achieved an accuracy of 77.02% ± 1.81%. In the reader study, the multi-class model showed comparable sensitivity and specificity to the dermatologists. For BCC, the dermatologists achieved a sensitivity of 0.770 and specificity of 0.962, while the CNN achieved 0.800 and 1.000, respectively. For MN, the dermatologists’ sensitivity and specificity were 0.807 and 0.897, compared to the CNN’s 0.800 and 0.840. For SK, the dermatologists’ sensitivity and specificity were 0.624 and 0.976, while the CNN achieved 0.850 and 0.940. For the “others” group, the dermatologists’ sensitivity and specificity were 0.939 and 0.875, compared to the CNN’s 0.750 and 0.940. In the two-class task, the dermatologists’ sensitivity and specificity for classifying psoriasis were 0.872 and 0.838, while the CNN achieved 1.000 and 0.605. Both the dermatologists and the CNN achieved at least moderate consistency with the reference standard, with no significant difference in Kappa coefficients.
The study highlights the potential of deep learning-based models to assist in the diagnosis of skin diseases, particularly in regions with limited access to dermatologists. The CNN models developed in this study, despite being trained on a relatively modest number of images, performed comparably to a large group of board-certified dermatologists. This suggests that such models could be used as pre-screening tools in primary care hospitals to prioritize cases for dermatologists, thereby improving the efficiency and accuracy of skin disease diagnosis.
The datasets used in the study were collected from the dermatology department of Peking Union Medical College Hospital, with all images obtained using a MoleMax HD 1.0 dermoscope. The images were annotated by experts with more than five years of experience, and any disagreements were resolved by a third expert. Images with poor focus, multiple lesions, or interference factors such as clothing fibers, written notes, or hair were excluded. The datasets were divided into training, validation, and testing sets in an 8:1:1 ratio, with ten-fold cross-validation performed to ensure robustness. The CNN was developed using the pre-trained GoogLeNet Inception v3 architecture, with the final layer retrained using the study’s images. The model used the ReLU activation function and the Gradient Descent Optimizer with a learning rate of 0.01. The loss function was minimized using cross-entropy mean.
The study also employed t-distributed Stochastic Neighbor Embedding (t-SNE) plots to visualize the internal features learned by the CNN. These plots showed that similar images were clustered together, demonstrating the model’s ability to distinguish between different skin diseases based on dermoscopic features. The multi-class model’s confusion matrix revealed that all categories achieved at least 80% classification accuracy, with the probability of misdiagnosis as one of the other categories being less than 12%.
The study’s findings are consistent with previous research demonstrating the effectiveness of deep learning in skin disease classification. For example, Esteva et al. (2017) showed that a CNN could achieve dermatologist-level classification of skin cancer using a large dataset of 129,450 images. Similarly, Fujisawa et al. (2018) demonstrated that a CNN trained on a smaller dataset of 4,867 images achieved higher accuracy than board-certified dermatologists in classifying 14 skin tumors. The current study builds on these findings by focusing on the Chinese population and including inflammatory skin conditions such as psoriasis in the classification model.
Despite its promising results, the study has several limitations. First, the dataset was derived from a single hospital, which may limit the generalizability of the findings. Second, the classification was based solely on dermoscopic images, whereas dermatologists also consider clinical images and additional information such as patient history, location, and tactility. Future studies could explore the integration of multiple data sources to improve diagnostic accuracy. Third, the dataset covered only 11 diseases, which is a small subset of the full spectrum of skin lesions encountered in clinical practice. Expanding the dataset to include more diseases could enhance the model’s utility.
In conclusion, this study demonstrates that deep learning-based models trained on relatively modest datasets of dermoscopic images can achieve diagnostic performance comparable to that of board-certified dermatologists. The multi-class and two-class models developed in this study could serve as valuable tools for pre-screening patients in primary care settings, particularly in regions with limited access to dermatologists. Future research should focus on expanding the dataset, integrating additional data sources, and validating the models in diverse clinical settings to further improve their accuracy and applicability.
doi.org/10.1097/CM9.0000000000001023
Was this helpful?
0 / 0