Task 3: Lesion Diagnosis
Submit automated predictions of disease classification within dermoscopic images.
Possible disease categories are:
The input data are dermoscopic lesion images in JPEG format.
All lesion images are named using the scheme
is a 7-digit unique identifier. EXIF tags in the images have been removed; any remaining EXIF tags should not be relied upon to provide accurate metadata.
The lesion images come from the HAM10000 Dataset, and were acquired with a variety of dermatoscope types, from all anatomic sites (excluding mucosa and nails), from a historical sample of patients presented for skin cancer screening, from several different institutions. Images were collected with approval of the Ethics Review Committee of University of Queensland (Protocol-No. 2017001223) and Medical University of Vienna (Protocol-No. 1804/2017).
The distribution of disease states represent a modified “real world” setting whereby there are more benign lesions than malignant lesions, but an over-representation of malignancies.
The response data are sets of binary classifications for each of the 7 disease states, indicating the diagnosis of each input lesion image.
Response data are all encoded within a single CSV file (comma-separated value) file, with each classification response in a row. File columns must be:
image: an input image identifier of the form
MEL: “Melanoma” diagnosis confidence
NV: “Melanocytic nevus” diagnosis confidence
BCC: “Basal cell carcinoma” diagnosis confidence
AKIEC: “Actinic keratosis / Bowen’s disease (intraepithelial carcinoma)” diagnosis confidence
BKL: “Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)” diagnosis confidence
DF: “Dermatofibroma” diagnosis confidence
VASC: “Vascular lesion” diagnosis confidence
Diagnosis confidences are expressed as floating-point values in the closed interval
[0.0, 1.0], where
0.5 is used as the binary classification threshold. Note that arbitrary score ranges and thresholds can be converted to the range of
1.0, with a threshold of
0.5, trivially using the following sigmoid conversion:
1 / (1 + e^(-(a(x - b))))
x is the original score,
b is the binary threshold, and
a is a scaling parameter (i.e. the inverse measured standard deviation on a held-out dataset). Predicted responses should set the binary threshold
b to a value where the classification system is expected to achieve 89% sensitivity, although this is not required.
Predicted diagnosis confidence values may vary independently, though exactly one disease state is actually present in each input lesion image.
Ground Truth Provenance
As detailed in the HAM10000 Dataset description, diagnosis ground truth (provided for training and used internally for scoring validation and test phases) were established by one of the following methods:
- Reflectance confocal microscopy
- Lesion did not change during digital dermatoscopic follow up over two years with at least three images
- Consensus of at least three expert dermatologists from a single image
In all cases of malignancy, disease diagnoses were histopathologically confirmed.
Predicted responses are scored using a normalized multi-class accuracy metric (balanced across categories). Tied positions will be broken using the area under the receiver operating characteristic curve (AUC) metric.
Clinical application on skin lesion classification has two goals eventually: Giving specific information and treatment options for a lesion, and detecting skin cancer with a reasonable sensitivity and specificity. The first task needs a correct specific diagnosis out of multiple classes, whereas the second demands a binary decision “biopsy” versus “don’t biopsy”. In the former ISIC challenges, focus was on the second task, therefore this year we want to rank for the more ambitious metric of normalized multiclass accuracy, as it is also closer to real evaluation of a dermatologist. This is also important for the extending reader study, where the winning algorithm(s) will be compared to physicians performance in classification of digital images.
Participants will be ranked and awards granted based only on the multiclass accuracy metric. However, for scientific completeness, predicted responses will also have the following metrics computed (comparing prediction vs. ground truth) for each image:
Individual Category Metrics
- area under the receiver operating characteristic curve (AUC)
- mean average precision
- F1 score
- AUC integrated between 80% to 100% sensitivity (AUC80) for Melanoma diagnosis only
- positive predictive value (PPV)
- negative predictive value (NPV)
- average AUC across all diagnoses
- malignant vs. benign diagnoses category AUC
To participate in this task:
- Download the training input data and training ground truth response data.
- Develop an algorithm for generating lesion diagnosis classifications in general.
- Validate (optional)
- Download the validation input data.
- Run your algorithm on the validation Input data to produce validation predicted responses.
- Submit these validation predicted responses to receive an immediate score. This will provide feedback that your predicted responses have the correct data format and have reasonable performance. You may make unlimited submissions.
- Download the test input data.
- Run your algorithm on the test input data to produce test predicted responses.
- Submit these test predicted responses. You may submit a maximum of 3 separate approaches/algorithms to be evaluated independently. You may make unlimited submissions, but only the most recent submission for each approach will be used for official judging. Use the “brief description of your algorithm’s approach” field on the submission form to distinguish different approaches. Previously submitted approaches are available in the dropdown menu.
- Submit a manuscript describing your algorithm’s approach.