Skip to content
Snippets Groups Projects
Commit 54aa6182 authored by pjm363 (Philip Monaco)'s avatar pjm363 (Philip Monaco)
Browse files

Merge branch '11-installation-instructions' into 'main'

add updated README.md

Closes #11

See merge request !9
parents c9f75671 461e9e8b
Branches
No related tags found
1 merge request!9add updated README.md
# Introduction # Introduction
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).
# Prerequisites for Use # Prerequisites for Use
1. Download the training data [Here](https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Training_Input.zip)
2. Download the training ground truth [Here](https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Training_GroundTruth.zip) All of the data will need to be placed in directory:
3. Download the test data images [Here](https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Validation_Input.zip)
4. Download the test data ground truth [Here](https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Validation_GroundTruth.zip) `hvm-image-clf/data`
5. Unzip the files into the data folder in this repository.
6. Download required packages from requirements.txt, "pip install -r requirements.txt" First we will need to download 3 data files used by the package. The first two can be downloaded by clicking the links below.
\ No newline at end of file
[Training Data](https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Training_Input.zip) - Unzip and place the folder named "ISIC2018_TASK3_Training_Input" and place in the data directory. This contains all of the training images.
[Ground Truth](https://isic-challenge-data.s3.amazonaws.com/2018/ISIC2018_Task3_Training_GroundTruth.zip) - Unzip and place the file named "ISIC2018_Task3_Training_GroundTruth.csv". This contains all of our ground truth labels of the training images.
The third dataset can be downloaded by following this link
[Metadata](https://dataverse.harvard.edu/file.xhtml?fileId=4338392&version=3.0). This will bring you to a website called Harvard Dataverse. On the page you will be able to see dropdown box called "Access File". Select the option called "Comma Separated Values (Original File Format) to download the dataset.
Project dependencies can be installed using
`pip install -r requirements.txt`
# Running the notebook.
If all of the pre-requisites are setup correctly, the notebook file can be run using an IPython or Anaconda distributed notebook ide.
\ No newline at end of file
%% Cell type:markdown id:441343d0-e422-4dae-b988-9130e5a0d565 tags:
## Packages
os: Operating system interface
OpenCV: cv2 `pip install opencv-python>=4.5.5`
%% Cell type:code id:d7e56e0e-7eec-429d-940b-c3337db4b4dc tags:
``` python
%load_ext autoreload
%autoreload 2
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import importlib as lib
from data_processing import load_sort_data, transform
import EDA
import matplotlib.pyplot as plt
%matplotlib inline
```
%% Cell type:markdown id:caa99aac tags:
# Introduction
The project we are presenting is a multi-label image classification task based on the 2018 Human vs Machine skin lesion analysis toward melanoma detection hosted by the International Skin Imaging Collaboration (ISIC).
This notebook will contain the following sections:
1. Problem Definition & Data Description
2. Data Preparation
3. Exploratory Analysis
4. Data Processing for Model Ingestion
5. Model Creation
6. Model Scoring & Evaluation
7. Interpretation of Results
%% Cell type:markdown id:830dee53 tags:
# 1. Problem Definition & Data description
#### Problem Definition:
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. With a sufficiently large and diverse collection of skin lesions we will develop a method to automate the prediction of disease classification within dermoscopic images. The project is meant to human computer computer collaboration and not intended to replace traditional forms of diagnosis.
Possible disease categories (and abbreviation) for classification are:
1. Melanoma (mel)
2. Melanocytic Nevus (nv)
3. Basal Cell Carcinoma (bcc)
4. Actinic Keratosis / Bowen's Disease (akiec)
5. Benign Keratosis (bkl)
6. Dermatofibroma (df)
7. Vascular Lesion (vasc)
#### Data Description
- Data images are in JPEG format using the naming scheme `ISIC_.jpg` where _ is a 7 digit unique identifier of the image.
- There are a total of 10,015 600x450 pixel color images contained in the training data folder.
- There are a total of 193 600x450 pixel color images contained in the validation data folder.
- The training metadata is a 10,015x8 .csv file containing the following variables*:
- lesion_id: Unique identifier of a legion with multiple images.
- image_id: Unique identifier of the associated image file.
- dx: Prediction label containing the 7 abbreviated disease categories.
- dx_type: Method of how the diagnosis was confirmed.
- age: Numeric year age of the patient.
- sex: String binary value 'male' or 'female'.
- localization: Categorical location on the body the image was taken.
- dataset: Image source.
*Further details of the data will be provided in the data preparation section.
%% Cell type:markdown id:96ff082e tags:
# 2. Data Preparation
#### Step 1. Load and Sort
First we will load the data using the function `load_sort_data()`.
The `load_sort_data()` function sorts the images into folders based on the diagnosis label. This will help reduce the overall size of the dataset and make preprocessing the images much faster. The function will return the metadata as a pandas DataFrame and the path of the sorted image folders.
%% Cell type:code id:b8c4f292 tags:
``` python
# function takes 3 parameters: metadata filename, the folder of the raw images, and the desired name of the destination directory.
metadata, dest_dir = load_sort_data('HAM10000_metadata', 'ISIC2018_Task3_Training_Input', 'Training_Images_')
```
%% Cell type:code id:7e9702c3 tags:
``` python
# The path of our training image folders sorted by label
dest_dir
```
%% Output
'c:\\Users\\Bennett Nolan\\Desktop\\info442\\hvm-image-clf/data/Training_Images_'
%% Cell type:markdown id:20b8415f tags:
#### Step 2. Tidy Metadata
We will now take steps to tidy our metadata.
First, subset the variables we intend on using, next analyze missingness and finally we will correct our expected datatypes.
%% Cell type:code id:0ba9148a tags:
``` python
# Subsetting into variables we will use.
metadata = metadata[['image_id', 'dx', 'age', 'sex', 'localization']]
# We will need to change the Dtypes of the columns into the expected types
metadata.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10015 entries, 0 to 10014
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 image_id 10015 non-null object
1 dx 10015 non-null object
2 age 9958 non-null float64
3 sex 10015 non-null object
4 localization 10015 non-null object
dtypes: float64(1), object(4)
memory usage: 391.3+ KB
%% Cell type:markdown id:3f4a3578 tags:
As we can see below, we have a total of 57 NA values in age. When looking at the distribution of NA values, only our largest quantity of labels have NA's. The age variable is only useful in providing context to our problem and will not be used as a feature in our model. Therefore it is not necessary to do anything further to the NA values. During exploratory analysis we can deal with the NA values as needed.
%% Cell type:code id:e6d378d5 tags:
``` python
# Sum the na values contained within each label
print("Total number of unique labels\n",
metadata['dx'].value_counts(),
"\nNumber of NaN values within each label\n",
metadata.drop('dx',1).isna().groupby(
metadata.dx,
dropna=False,
observed = True
).sum().reset_index()
)
```
%% Output
Total number of unique labels
nv 6705
mel 1113
bkl 1099
bcc 514
akiec 327
vasc 142
df 115
Name: dx, dtype: int64
Number of NaN values within each label
dx image_id age sex localization
0 akiec 0 0 0 0
1 bcc 0 0 0 0
2 bkl 0 10 0 0
3 df 0 0 0 0
4 mel 0 2 0 0
5 nv 0 45 0 0
6 vasc 0 0 0 0
C:\Users\Bennett Nolan\AppData\Local\Temp\ipykernel_10700\2944415004.py:5: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
metadata.drop('dx',1).isna().groupby(
%% Cell type:code id:91aa284b tags:
``` python
#Changing datatypes
dtypes = {'image_id':'string',
'dx':'category',
'sex':'category',
'localization': 'category'
}
metadata = metadata.astype(dtypes)
metadata.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10015 entries, 0 to 10014
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 image_id 10015 non-null string
1 dx 10015 non-null category
2 age 9958 non-null float64
3 sex 10015 non-null category
4 localization 10015 non-null category
dtypes: category(3), float64(1), string(1)
memory usage: 187.1 KB
%% Cell type:markdown id:41f467b5 tags:
#### Step 3. Image Processing
In this step we will construct an NxM matrix where N is an image and M is the number of pixels in the image.
%% Cell type:code id:05398a91 tags:
``` python
#Assign vectorized images to variables
akiec_images = transform(dest_dir + 'akiec')
```
%% Cell type:code id:e8642d8d tags:
``` python
bcc_images = transform(dest_dir + 'bcc')
```
%% Cell type:code id:5312b5de tags:
``` python
bkl_images = transform(dest_dir + 'bkl')
```
%% Cell type:code id:49338970 tags:
``` python
df_images = transform(dest_dir + 'df')
```
%% Cell type:code id:784d69cd tags:
``` python
mel_images = transform(dest_dir + 'mel')
```
%% Cell type:code id:6cd167a7 tags:
``` python
#This takes a really long time to run even when cutting down the images size.
nv_images = transform(dest_dir + 'nv', size=(200, 150))
```
%% Cell type:code id:4de5cec3 tags:
``` python
vasc_images = transform(dest_dir + 'vasc')
```
%% Cell type:code id:d92158fa tags:
``` python
mel_images.shape
```
%% Output
(1113, 67500)
%% Cell type:markdown id:3fb5f03e tags:
# 3. Exploritory Data Analysis
Exploritory analysis will be conducted on in two major steps. First we will complete analysis on the metadata then the image dataset.
%% Cell type:markdown id:6ca3d6bf tags:
#### Step 1: Metadata EDA
We will perform the following analysis on the metadata:
- Summary Statistics
- Class label distributions
- Correlation
%% Cell type:markdown id:686965dd tags:
Summary Statistics
%% Cell type:code id:5d475aed tags:
``` python
metadata.head()
```
%% Output
image_id dx age sex localization
0 ISIC_0027419 bkl 80.0 male scalp
1 ISIC_0025030 bkl 80.0 male scalp
2 ISIC_0026769 bkl 80.0 male scalp
3 ISIC_0025661 bkl 80.0 male scalp
4 ISIC_0031633 bkl 75.0 male ear
%% Cell type:code id:6e579e93 tags:
``` python
metadata.agg({
"age":["min", "max", "median", "mean", "skew"]
})
```
%% Output
age
min 0.000000
max 85.000000
median 50.000000
mean 51.863828
skew -0.166802
%% Cell type:code id:ea361300 tags:
``` python
metadata.groupby("sex")["sex"].count()
```
%% Output
sex
female 4552
male 5406
unknown 57
Name: sex, dtype: int64
%% Cell type:markdown id:e031ebf0 tags:
Class Label Distributions
%% Cell type:markdown id:bf51add8 tags:
Distributions for metadata including Age, Localization, Sex, and Diagnosis
%% Cell type:code id:6681e88c tags:
``` python
fig =plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(221)
metadata['sex'].value_counts().plot(kind='bar', ax=ax1)
ax1.set_title('Sex')
ax2=fig.add_subplot(222)
metadata['localization'].value_counts().plot(kind='bar', ax=ax2)
ax2.set_title('Localization')
```
%% Output
Text(0.5, 1.0, 'Localization')
%% Cell type:code id:f6596829 tags:
``` python
fig =plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(221)
metadata['age'].value_counts().plot(kind='bar', ax=ax1)
ax1.set_title('Age')
ax2=fig.add_subplot(222)
metadata['dx'].value_counts().plot(kind='bar', ax=ax2)
ax2.set_title('Diagnosis')
```
%% Output
Text(0.5, 1.0, 'Diagnosis')
%% Cell type:markdown id:8858805d tags:
Correlation
%% Cell type:markdown id:fbc246dd tags:
Cross Tabulation of Age and Dx (Skin Lesion)
nv = Melanocytic nevi
mel = Melanoma
bkl = Benign keratosis-like lesions
bcc = Basal cell carcinoma
akiec = Actinic keratosis
vas = Vascular lesions
df = Dermatofibroma
%% Cell type:code id:a0602660 tags:
``` python
ct = pd.crosstab(index=metadata['age'], columns=metadata['dx'])
print(ct)
```
%% Output
dx akiec bcc bkl df mel nv vasc
age
0.0 0 0 5 0 0 30 4
5.0 0 0 1 0 1 81 3
10.0 0 0 0 0 0 39 2
15.0 0 0 0 0 0 73 4
20.0 0 1 0 0 6 158 4
25.0 0 3 0 2 16 221 5
30.0 1 4 6 4 34 410 5
35.0 0 5 24 12 36 668 8
40.0 9 23 46 9 49 846 3
45.0 10 26 59 14 74 1100 16
50.0 19 27 87 18 96 928 12
55.0 27 25 95 13 142 686 21
60.0 58 35 131 9 106 454 10
65.0 38 79 108 18 133 351 4
70.0 56 85 183 4 166 248 14
75.0 47 76 153 9 91 231 11
80.0 37 73 98 3 85 97 11
85.0 25 52 93 0 76 39 5
%% Cell type:code id:0a82f299 tags:
``` python
ct2 = pd.crosstab(index=metadata['sex'], columns=metadata['dx'])
print(ct2)
```
%% Output
dx akiec bcc bkl df mel nv vasc
sex
female 106 197 463 52 424 3237 73
male 221 317 626 63 689 3421 69
unknown 0 0 10 0 0 47 0
%% Cell type:code id:ad813357 tags:
``` python
from scipy.stats import chi2_contingency
#Sex & Localization
chi2= chi2_contingency(ct2)
print("PValue = " , chi2[1])
```
%% Output
PValue = 2.4464388098587195e-17
%% Cell type:code id:20474e6b tags:
``` python
#Age & DX
chi2_2= chi2_contingency(ct)
print("PValue = " , chi2_2[1])
```
%% Output
PValue = 0.0
%% Cell type:markdown id:864a5b70 tags:
#### Step 2: Image EDA
We will perform the following analysis on the metadata:
- Average image of each label.
- Contrast between the average images.
- Principal component analysis (PCA) on each label.
%% Cell type:markdown id:b80a2f2b tags:
Average Image
%% Cell type:code id:971b33b7 tags:
``` python
EDA.find_mean_img(bkl_images, "Vasc Images")
```
%% Output
array([[121.11101 , 122.14468 , 122.84349 , ..., 119.93995 , 118.71974 ,
117.54959 ],
[122.58053 , 123.72703 , 124.56142 , ..., 121.46952 , 119.99272 ,
119.01729 ],
[122.78344 , 124.01911 , 124.747955, ..., 121.76524 , 120.1738 ,
119.262054],
...,
[121.69427 , 122.42311 , 123.208374, ..., 116.711555, 115.545044,
114.91993 ],
[121.50591 , 122.17652 , 123.054596, ..., 116.464966, 115.33849 ,
114.60964 ],
[120.98453 , 121.68972 , 122.56961 , ..., 115.86988 , 114.70428 ,
113.95268 ]], dtype=float32)
%% Cell type:markdown id:3acaca8b tags:
Contrast Between Images
%% Cell type:code id:39497a39 tags:
``` python
```
%% Cell type:markdown id:e327218b tags:
Principal Component Analysis.
%% Cell type:code id:ee96ea64 tags:
``` python
EDA.plot_pca(EDA.eigenimages(bkl_images, "bkl images"))
```
%% Output
Number of PC: 6
%% Cell type:markdown id:3157d03a tags:
# 4. Data Processing for Model Ingestion
%% Cell type:markdown id:64adf033 tags:
# 5. Model Creation
%% Cell type:markdown id:c3114115 tags:
# 6. Model Scoring & Evaluation
%% Cell type:markdown id:38d22b6e tags:
# 7. Interpretation of Results
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment