"<ipython-input-5-4c0ddd092066>:5: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n",
"C:\\Users\\Bennett Nolan\\AppData\\Local\\Temp\\ipykernel_10700\\2944415004.py:5: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n",
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
%% Cell type:markdown id:caa99aac tags:
# Introduction
The project we are presenting is a multi-label image classification task based on the 2018 Human vs Machine skin lesion analysis toward melanoma detection hosted by the International Skin Imaging Collaboration (ISIC).
This notebook will contain the following sections:
1. Problem Definition & Data Description
2. Data Preparation
3. Exploratory Analysis
4. Data Processing for Model Ingestion
5. Model Creation
6. Model Scoring & Evaluation
7. Interpretation of Results
%% Cell type:markdown id:830dee53 tags:
# 1. Problem Definition & Data description
#### Problem Definition:
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. With a sufficiently large and diverse collection of skin lesions we will develop a method to automate the prediction of disease classification within dermoscopic images. The project is meant to human computer computer collaboration and not intended to replace traditional forms of diagnosis.
Possible disease categories (and abbreviation) for classification are:
1. Melanoma (mel)
2. Melanocytic Nevus (nv)
3. Basal Cell Carcinoma (bcc)
4. Actinic Keratosis / Bowen's Disease (akiec)
5. Benign Keratosis (bkl)
6. Dermatofibroma (df)
7. Vascular Lesion (vasc)
#### Data Description
- Data images are in JPEG format using the naming scheme `ISIC_.jpg` where _ is a 7 digit unique identifier of the image.
- There are a total of 10,015 600x450 pixel color images contained in the training data folder.
- There are a total of 193 600x450 pixel color images contained in the validation data folder.
- The training metadata is a 10,015x8 .csv file containing the following variables*:
- lesion_id: Unique identifier of a legion with multiple images.
- image_id: Unique identifier of the associated image file.
- dx: Prediction label containing the 7 abbreviated disease categories.
- dx_type: Method of how the diagnosis was confirmed.
- age: Numeric year age of the patient.
- sex: String binary value 'male' or 'female'.
- localization: Categorical location on the body the image was taken.
- dataset: Image source.
*Further details of the data will be provided in the data preparation section.
%% Cell type:markdown id:96ff082e tags:
# 2. Data Preparation
#### Step 1. Load and Sort
First we will load the data using the function `load_sort_data()`.
The `load_sort_data()` function sorts the images into folders based on the diagnosis label. This will help reduce the overall size of the dataset and make preprocessing the images much faster. The function will return the metadata as a pandas DataFrame and the path of the sorted image folders.
%% Cell type:code id:b8c4f292 tags:
``` python
# function takes 3 parameters: metadata filename, the folder of the raw images, and the desired name of the destination directory.
# We will need to change the Dtypes of the columns into the expected types
metadata.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10015 entries, 0 to 10014
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 image_id 10015 non-null object
1 dx 10015 non-null object
2 age 9958 non-null float64
3 sex 10015 non-null object
4 localization 10015 non-null object
dtypes: float64(1), object(4)
memory usage: 391.3+ KB
%% Cell type:markdown id:3f4a3578 tags:
As we can see below, we have a total of 57 NA values in age. When looking at the distribution of NA values, only our largest quantity of labels have NA's. The age variable is only useful in providing context to our problem and will not be used as a feature in our model. Therefore it is not necessary to do anything further to the NA values. During exploratory analysis we can deal with the NA values as needed.
%% Cell type:code id:e6d378d5 tags:
``` python
# Sum the na values contained within each label
print("Total number of unique labels\n",
metadata['dx'].value_counts(),
"\nNumber of NaN values within each label\n",
metadata.drop('dx',1).isna().groupby(
metadata.dx,
dropna=False,
observed=True
).sum().reset_index()
)
```
%% Output
Total number of unique labels
nv 6705
mel 1113
bkl 1099
bcc 514
akiec 327
vasc 142
df 115
Name: dx, dtype: int64
Number of NaN values within each label
dx image_id age sex localization
0 akiec 0 0 0 0
1 bcc 0 0 0 0
2 bkl 0 10 0 0
3 df 0 0 0 0
4 mel 0 2 0 0
5 nv 0 45 0 0
6 vasc 0 0 0 0
<ipython-input-5-4c0ddd092066>:5: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
C:\Users\Bennett Nolan\AppData\Local\Temp\ipykernel_10700\2944415004.py:5: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
metadata.drop('dx',1).isna().groupby(
%% Cell type:code id:91aa284b tags:
``` python
#Changing datatypes
dtypes={'image_id':'string',
'dx':'category',
'sex':'category',
'localization':'category'
}
metadata=metadata.astype(dtypes)
metadata.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10015 entries, 0 to 10014
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 image_id 10015 non-null string
1 dx 10015 non-null category
2 age 9958 non-null float64
3 sex 10015 non-null category
4 localization 10015 non-null category
dtypes: category(3), float64(1), string(1)
memory usage: 187.1 KB
%% Cell type:markdown id:41f467b5 tags:
#### Step 3. Image Processing
In this step we will construct an NxM matrix where N is an image and M is the number of pixels in the image.
%% Cell type:code id:05398a91 tags:
``` python
#Assign vectorized images to variables
akiec_images=transform(dest_dir+'akiec')
```
%% Cell type:code id:e8642d8d tags:
``` python
bcc_images=transform(dest_dir+'bcc')
```
%% Cell type:code id:5312b5de tags:
``` python
bkl_images=transform(dest_dir+'bkl')
```
%% Cell type:code id:49338970 tags:
``` python
df_images=transform(dest_dir+'df')
```
%% Cell type:code id:784d69cd tags:
``` python
mel_images=transform(dest_dir+'mel')
```
%% Cell type:code id:6cd167a7 tags:
``` python
#This takes a really long time to run even when cutting down the images size.