Detection of Malignant Melanoma Using Artificial Intelligence: An Observational Study of Diagnostic Accuracy

Methods: DERM was trained and tested using 7,102 dermoscopic images of both histologically confirmed melanoma (24%) and benign pigmented lesions (76%). A meta-analysis was conducted of studies examining the accuracy of naked-eye examination, with or without dermoscopy, by specialist and general physicians whose clinical diagnosis was compared to histopathology. The meta-analysis was based on evaluation of 32,226 pigmented lesions including 3,277 histopathology-confirmed malignant melanoma cases. The receiver operating characteristic (ROC) curve was used to examine and compare the diagnostic accuracy.

such as the recent Cochrane reviews of skin cancer. The image dataset was collated from several different sources including the PH2 dataset [29], Interactive Atlas of Dermoscopy [30], and ISIC archive [31].

Methods
An additional 672 dermoscopic lesion images were collected from a variety of other sources. The ISIC archive contains a large number of images obtained from children, which are easy to classify as benign. Their inclusion in the dataset was found to optimistically bias results so they were excluded from the development work. The ISIC archive also contains a large number of identical and near-identical images which were removed from the dataset. The involved smartphone photography and 4 provided an estimate of the probability of malignancy. None of these apps had been assessed for diagnostic accuracy [17]. Understandably there is concern about the possible harm to patients that poorly designed, inaccurate, and/or misleading consumer apps may cause [18][19][20]

Introduction
Malignant melanoma (MM) is less common than basal and squamous cell skin cancer; however, the incidence of MM is increasing faster than that of other forms of cancer and it is responsible for the majority of skin cancer deaths [1]. Early diagnosis of MM (stage 1) has more than 95% five-year relative survival rate compared with 8% to 25% for MM diagnosed at later stages [2].

Current practice guidelines in the
United Kingdom recommend appropriately trained health care professionals assess all suspect pigmented lesions using dermoscopy [1,3]. Diagnosis is confirmed with biopsy, histological examination, and specialist pathological interpretation. Pressure to diagnose MM early leads to a high proportion of benign pigmented lesions being referred from primary care to specialist care, and a large proportion of biopsied lesions are found to be benign [4,5]. This creates increased demands on overburdened secondary care and pathology service resources [6]. Improved accuracy of pigmented lesion review in primary care would help reduce this pressure.
However, the diagnostic accuracy is still dependent on the degree of experience of the examiners and the equipment required is costly [16]. Conclusions: DERM has the potential to be used as a decision support tool in primary care, by providing dermatologist-grade recommendation on the likelihood of malignant melanoma.

ABSTRACT
number of MM diagnoses confirmed by histology, from which the counts could be derived. The reports were also examined for information concerning physician experience (general vs specialist physician) and context of use (primary care, secondary care). A meta-analysis from this data was conducted. The Stata user-written packages METANDI [42] and MIDAS [43] were used, and a meta-regression was used to examine associations between diagnostic accuracy and year of study report,  [32]. The gold standard for MM was histopathology. We examined different cut-points used by DERM to categorize lesions as positive or negative, ie, illustrating alternative diagnostic rules from the diagnostic model [33]. The methods of Youden [34] and Liu [35] were used, as well as the values that maximized the ROC area, resulted in a sensitivity and a specificity of 95%, and generated less than 1% false negative. The area under the curve (AUC) of the ROC curve, specificity/sensitivity, and diagnostic odds ratios were calculated for each of these cut-points.
The ROC AUC is not a perfect assessment measure for diagnostic methods when the standard error of the estimator is quite different for the diagnostic alternatives (benign pigmented lesions vs MM), as is the case for DERM (see Figure 1) [36]. This issue was addressed by constructing the Lorenz curve (a mirror image of the ROC curve) with the associated Gini index [37].
To compare the accuracy of DERM with that of current diagnostic practices, we decided to conduct a meta-analysis of studies of diagnostic accuracy for MM rather than have a limited panel of dermatologists conduct parallel assessments, as has been done in other studies [21,38]. We chose this approach because biopsy-based histopathology provides the gold standard for MM diagnosis,      Table 3, where experts have both higher sensitivity and specificity than nonexperts, and is most marked for specificity for both methods and for sensitivity only for dermoscopy ( Figure 6).    results confirm that clinician experience and use of dermoscopy improve accuracy. DERM achieves an AUC of 0.93, sensitivity and specificity of 85% and 85%, respectively, when using the estimated optimum value of 0.28. This is higher than naked-eye visual assessment (0.88, 80% and 71%), and similar to findings for dermatologists with dermoscopy (0.91, 85% and 82%). This is illustrated by plotting a ROC curve of the data from studies in the meta-analysis, and superimposing the DERM data from 4 cut-points (Figures 6 and 7).
A recent comprehensive series of Cochrane reviews concluded that visual inspection alone had a specificity of 42% at a fixed sensitivity of 80% and a sensitivity of 76% at a fixed specificity of 80%, whereas dermoscopy plus visual inspection had a specificity of 92% at a fixed sensitivity of 80% and a sensitivity of 82% at a fixed specificity of 80% (0.83 vs 0.91) (Figure 7). There was no association between the AUC and year of study publication, suggesting that diagnostic accuracy is not improving over time (P = 0.63).

Discussion Summary
Herewith we present an extensive evaluation of the ability of DERM to identify MM from dermoscopic images of skin lesions. This preliminary analysis demonstrates the ability of an AI-based system to learn features of a skin lesion that are associated with MM, which can then be applied to the identification of MM. We conducted a meta-analysis of MM diagnostic accuracy to generate comparative values from current primary care and specialist dermatologist practices. These The number of estimates exceeds the number of studies because multiple estimates are made using dermoscopy with alternative diagnostic algorithms. CI = confidence interval; sROC = summary receiver operating characteristic.

Strengths and Limitations
We trained our algorithm using archived images that have been published to train clinicians. It is likely that biases exist in the datasets (eg, patient demographics, MM subtypes, image capture methods), but it is very difficult to determine whether such biases exist and thus have been introduced into DERM during its development. In addition, it must be [45]. Our meta-analysis showed for visual inspection alone specificity of 83% when sensitivity was 80%; sensitivity of 78% when specificity was 80%; specificity of 86% when sensitivity was 80%; and sensitivity of 87% when specificity was 80%. DERM gave comparable indices of specificity of 89% at sensitivity of 80% and a sensitivity of 90% at specificity of 80%.

Conclusions
Our study demonstrates the ability of an AI-based system to learn features of a skin lesion photograph that are associated with MM. DERM has the potential to be used in primary care to provide dermatologist-grade decision support. It is too early to say deployment of DERM would reduce onward referral, but such clinical validation is ongoing. emphasized that the algorithm was trained predominantly using images of images rather than images created in a clinical setting. We are currently collecting such images during a clinical trial and plan to report the results in the near future.
By using postbiopsy histology as the gold standard for both DERM and the inclusion criteria for our meta-analysis, images of nonsuspicious lesions have not been included when training or evaluating DERM. We have therefore not shown the ability of DERM (or clinicians) to accurately classify nonsuspicious lesions, which could lead to verification bias as was observed by a study of cancer registry data during a prospective follow-up [46]. However, this bias will apply to both the evaluation of DERM and the meta-analysis results, so it seems unlikely that the comparison of the 2 would be affected, but it remains a possibility.
A strength of our study is that the use of a meta-analysis of naked-eye examination and dermoscopy, the most common current diagnostic methods for MM used in primary care, is based on evaluation of 32,226 pigmented lesions including 3,277 histopathology-confirmed MM.

Comparison With Existing Literature
Recently, 2 other groups who retooled versions of Google's Inception network for the identification of melanoma showed accuracy equivalent to or better than that of a panel of dermatologists [22,23]. However, this approach is likely to generate issues such as overfitting (because of the small size of the review panel) and a lack of generalization (because of the selected nature of the voluntary reviewers).
A recent addition to the literature was the publication of an extensive systematic review by the Cochrane Collaboration skin group [45]. Four studies were conducted on melanoma diagnosis in adults by visual inspection, dermoscopy with and without visual inspection, reflectance confocal microscopy, and smartphone applications for triaging suspicious lesions. The dates of publication were slightly different from our study dates (up to August 2016 compared with September 2017), they searched more databases, and they did not limit themselves to histology-confirmed pathology as the diagnostic outcome but also included clinical follow-up of benign-appearing lesions, cancer registry follow-up, and "expert opinion with no histology or follow-up." Despite these differences, the number of studies is very similar. We identified 108 studies (29 visual and 79 dermoscopy) and they identified 104 (24 visual and 86 dermoscopy).

Implications for Research and Practice
Using different cut-points at which DERM defines a lesion as MM, the sensitivity and specificity ranged between 85.0% to 98.6% and 85.3% to 62.9%, respectively. The cut-points calculated by the Youden and Liu methods assume that