Implementation of Automatic Skin Lesions Diagnosis A Deep Learning Ensembling Approach

Skin is the largest organ of the human body, it protects them from the external environment. However, the number of new skin cancer cases has been going up in the last few years. Fortunately diagnosing skin cancer in earlier stages could increase the probability of treating cancer. Due to machine learning (ML) and deep learning(DL) has achieved great success in di erent elds and has shown an outstanding performance in computer vision applications, lots of ML Applications is concerned in this domain. The main goal of this study is to nd a computer-aided diagnosis system that can be powerful in early detection, which saves time and money. A simple pipeline is used to train di erent types of convolutional neural networks (CNNs), choosing the best four networks and ensemble them together without any prior segmentation or repainting. By using the HAM10000 dataset for training and testing, a high-accurate model is achieved which can classify seven kinds of skin lesions very precisely.


Introduction
Skin cancer became more common nowadays because of the Ozone hole, Global warming and the high-level of pollution, especially in developing countries. According to American Cancer Society statistics in 2019, the number of patients is increasing, whereas 96,480 new melanomas (the deadliest skin cancer) will be diagnosed and 7,230 people are expected to die because of melanoma [1]. Dermatologists prefer to start with non-invasive methods to detect the lesions (such as ABCDE, 7-points checklist and more methods) to avoid excisional biopsies which are invasive, painful, costly and timeconsuming. Dermatoscopy is a non-invasive examination technique that is used widely by derc 2020 Journal of Advanced Engineering and Computation (JAEC) 261 matologists to improve diagnosis better than examination with the naked eyes [2]. Figure 1 displays some dermoscopic images for kinds of lesions.
Due to the imbalanced numbers of dermatologists and patients, and the huge amounts of money expensing annually at treatments, computer-aided diagnosis systems (CAD) become much involved in this domain. In the last few years, Deep Learning takes a signicant role in healthcare applications and especially in Skin Lesion applications, due to the increasing amount of available datasets, which are classied as a medium-size dataset. Indeed, the stateof-the-art image-based CAD achieved signicant results that can be compared to dermatologist experts who have more than 10 years of experience. Moreover, researchers have compared between machine-learning algorithms predictions and dermatologists diagnoses on the HAM10000 dataset and they found that the best machinelearning algorithms achieved a mean of (6.65 and 7.94) more correct diagnoses than experts and average human readers respectively [2].
Regarding the previous work in this domain and because of the complex surface of the skin, the process of diagnosis can be split up into many steps, starting with image acquisition and ending with classication. Image process techniques are used to remove noise from images or to take the best features from the image, such as DullRazor or other techniques to remove hair and rulers [3]- [5], resizing images to t networks, remove black boundaries [6] and colour space transformation [7], etc. In addition to that, image segmentation is a very important step that removes the normal skin area, which is not required. There are three approaches to do segmentation, one by using image-processing techniques such as Otsu thresholding [8], active contours, GrabCut [9] and relative fuzzy membership degree (RFMD) [4,10]. The second approach is to use Deep Learning to train networks that can segment the lesion eectively, such as U-net, ResUNet [11,12], etc. The third approach is to use a hybrid technique such as to use YOLO with GrabCut [9]. Attribute detection is also used to detect attributes of the lesion such as globules and streaks, Deep CNNs are used for this task such as Multi-Task U-Net [13]. Finally classication task, one approach is to extract some features manually form the images, then using these features to feed a neural network [5], or other classication methods such as KNN, SVM, decision tree [14,15], etc. Another approach is to use CNNs which performed well in other elds and ne-tuning them to t the skin cancer datasets [16,17] or using those pre-trained networks to extract features from the dataset and classify them using classical classication methods such as SVM [15,18]. Furthermore, ensembling techniques performed better than others [15]- [17], [19].
Developing an automatic system that could address the techniques, technologies, and terminology used in skin imaging to help dermatologists to diagnose seven kinds of skin lesions. However, due to the complexity of the dataset, the existed models still lack the accuracy, although many eorts were applied to nd the best solution to this problem. Therefore, the main objective of this study is to nd a highly accurate model that could classify these seven skin lesions very precisely without applying any kind of segmentation or denoising to the images.
The proposed idea is to nd multiple networks that perform well on the task of skin lesion classication, then fuse them together to get better results. The rst step is implementing some of the convolutional neural networks that performed well on dierent tasks of classication, then retrain them with the HAM10000 data, by adding class weights to deal with the imbalanced data and observing their behaviour on this data, after training it is clear that each network is sensitive for some classes more than the others. The networks are chosen due to their sensitivity to all classes. Then the networks are ensembled together by averaging them. The nal model is tested by using 2003 images from the holdout test set (20% of the dataset) and the accuracy is found to be increased by 3% to 4% in comparison to the accuracy of each network individually. nant or benign, this study will take seven kinds of skin lesions to classify, this will make the task more complex. The data set used in this study is HAM10000, it contains seven classes (actinic keratosis / Bowen's disease (intraepithelial carcinoma) (AKIEC), Melanoma (MEL), melanocytic nevus (NV), actinic keratosis / Bowen's disease (intraepithelial carcinoma) (AKIEC), benign keratosis (solar lentigo / seborrheic keratosis/lichen planus-like keratosis) (BKL), dermatobroma (DF), basal cell carcinoma (BCC), and vascular lesion (VASC) with more than 10000 dermoscopic images. More than 50% of lesions are conrmed through histopathology, the ground truth for the rest of the cases is either follow-up examination, expert consensus, or conrmation by in-vivo confocal microscopy [20]. As shown in Fig. 2 some classes have less than 200 samples (such as DF and VASC) where the NV class has more than 6000 samples which make the training process dicult because the predictions of these networks are more likely to be NV.
This data suers from variance, noise, and imbalance. Moreover, adding segmentation steps in front of classication would give undesirable results if the segmentation is not as perfect as it should be, this will add more noise to the data. In addition to that, removing the hair from images using the DullRazor technique will aect the texture of the images, due to the median lter and repainting the hair's pixels in the original images with neighbour pixels. This technique also relies on the black thin hair only. So that this study will avoid any kinds of de-noising or segmentation steps.

CNN architecture A. Convolutional layer
It is the fundamental layer in CNN and it is a mathematical computation to combine two groups of data. Convolving lter on the input will give the feature map, by concatenating these feature maps we get the convolution layer. The lters weights have random values at the beginning of training during training the optimizer change these weights to improve the results of the model in the backpropagation process.

B.Pooling layer
Pooling is used to reduce the dimensions, just height and width and keep the depth as it. This will help in reducing the number of parameters so that training will be easier, faster and combat overtting.

C. Fully-connected layer
It is a normal dense layer where each node is connected to all nodes in the previous layer. The convolutional layers output will be the features of the images so that FC layers are added to classify these features into a number of classes. Before connecting the features with the FC layers, it is better to atten these 3D features to make  them single-dimensional vector without losing any information.

D. Softmax layer
It is a dense layer with a number of nodes denotes to the classes, and Softmax activation function is applied to it. This function computes the probability of each class to adapt the results into values between 0 and 1 using this equation: The pipeline shown in Fig. 4 explains the followed procedure for building the classication model. In the beginning, the HAM10000 dataset is divided into ve folds, as shown in Tab. 1, in order to apply cross-validation. Indeed to ensure maintaining enough number of samples in each class for testing, only ve folds were chosen because some classes have less than 150 samples so that at least 19 samples will be held for testing in these classes. Whereas at every single check, only one fold will be used for testing and the others for training. This helps in nding the actual accuracy of each network and prevents testing using biased data, which could be a powerful technique for monitoring the behaviour of the model and for generalization.   The original size of the images in HAM10000 dataset is 600x450. Therefore, two preprocessing steps are applied: rst, resizing all images to 256x192 using bicubic interpolation, this step is important to reduce the computational cost and to decrease the probability of overtting, besides, 256x192 was chosen to maintain the dierence between width and height in the origin images. The second step is to normalize the images to range between zero and one to feed into the Networks by dividing each pixel by 255, which is the maximum value that pixels can have (pixels original range is between 0 and 255). Normalization has a good inuence to increase converging time since all data dimensions will have the same scale.

B. Data Augmentation
Increasing the number of images in the dataset has a good inuence on the training process by reducing the overtting and helping the model to generalize well. By taking into consideration that the augmentation methods will not damage the useful information in the images. These are the selected methods: horizontal ipping, vertical ipping, randomly rotate images in a range between +10 and -10 degrees, randomly zooming image by zooming factor 0.1, and randomly shifting images horizontally and vertically by shifting factor 0.1.

C. Evaluating metrics
Many kinds of metrics are used to evaluate the proposed models and to compare them with the previous works, and the used evaluating metrics are shown in Eqs. 2-7: Precision: Recall: Accuracy: F1-Score: AUC: Area Under Receiver Operating characteristic (ROC) Curve: it is generated by plotting the TPR (y-axis) against FPR (x-axis).
T P R = T P T P + F N This paper investigates the performance of some CNNs that is achieved a good performance on ImageNet. The predictions of models for the holdout test set are reported in Tab. 2.
All of the chosen networks are trained on more than 1.5 million images in the ImageNet dataset to classify 1000 dierent classes. Therefore, using transfer learning can help to train the networks with a small dataset and a small number of epochs.
All networks were prepared and trained as following: First, taking the pre-trained networks with their original weights and replace the original fully connected (FC) layers by new FC layers, and then initialized their weights using Glo-rot_Uniform for the kernel weights and zeros for biases weights. Global Average Pooling technique is used to connect between the last convolutional layer and the new FC layer with 512 neurons, because it is very eective in attening the convolutional layers and reducing the dimension of the output by taking kernel with the same dimensions of the feature map, then replacing each channel with the average value. This has many benets such as reducing the number of parameters, resisting overtting and making the model less sensitive to the location of the lesion in the source images in addition to the downsampling in each layer in the original architecture. A dropout layer is added after the FC layers with a 0.5 dropout factor to resist overtting by ignoring some unites from the FC layers which become sparse. Finally, Softmax layers are added on the top of each network to obtain the classication results. Adam optimizer is used to train models with an initial learning rate 0.001, then reducing its value by 0.5 factor after 3 epochs if the accuracy did not get better. Before training the entire network, it is necessary to train just the new layers for 3 epochs while the other layers are frozen. This step has a good inuence on training because the loss will be very big at the beginning of the training because of the new layers, whereas pushing the weights of the pre-trained layer as big as the weights of new layers will defect the trained layers. Weighted categorical cross-entropy represented in formula (8) is used to deal with data imbalance.
Where wi is the weight of class i, p(x) are the ground truth label and q(x) is the predicted value from the Softmax layer. vasc'] respectively. However, during the training additional weight is added for dierent classes for dierent networks to make each network has more sensitivity to a specic class. For example, 6.0 is set as a weight for the rst class (AKIEC) while training the VGG 16 Network and due to that, it is obvious that this network shows the best results for the rst class. Figure  5 displays the confusion matrix and AUC-ROC for the best networks, which will be used.
As ResNet152, VGG16, DensNet169, and Xception achieved the best accuracy among the other networks, cross-validation technique with 5 folds was applied in order to ensure that these models are unbiased to any class. The results of cross-validation are shown in Tab. 3, where a slight varying in the accuracy was detected. Therefore, the training and testing data could be considered as unbiased.

E. Fusing models
After training all networks individually using the same data and the same procedure, ResNet152, VGG16, DensNet169, and Xception were chosen, in order to ensemble them due to their good behaviour at classifying dierent classes. Thus, each network will be superior to a particular class. Then, the trained networks are merged by calculating the average of the outputs of the last layer for each network instead of taking only one output value from each network, then selecting the highest probability. Indeed, this strategy could exploit the ability of each network to distinguish between all classes. This could be done by taking the outputs from the last layer of each network after applying a linear activation function instead of using Softmax, and then, all outputs will be added to each other to compute the average of each class individually. Thus, the model will give seven values, a value for each class, and the result will be the maximum value of them. This combines the dierent behaviour of the networks on the classes to give even better results that none of the models could achieve individually. As shown in Tab. 4 that after ensembling, the network achieved F1-score better than all networks for four dierent classes. Comparing the AUC for the proposed model and its individual networks, from Figs. 4 and 5, using the holdout dataset it is noticeable that the proposed model has a similar behaviour to its parents in distinguishing between classes precisely with a slight improvement. Furthermore, the confusion matrices show a signicant improvement in the accuracy of the proposed model compared to its parents. Whereas the accuracy is increased from (87%, 86%) to 90%.

F. Computational Cost
The computational cost of the model is an important parameter to dene the quality of the model. It detects the training time of each network and the execution time in the online mode of the model. In many cases, the purpose of the models is to work online so that it should be fast and suitable for the device that the model will work on it. Indeed the computational cost of the proposed model is high due to using four dierent architecture for deep neural networks which means the number of the parameters in the model will be very high and the execution time will be high too. To decrease the cost global average pooling was used in each network, in addition to using a small number of neurons in the Dense layers of each network 512 instead of 4096 or 2048 like the original networks. Therefore, the number of parameters in the proposed model is 109,817,668 which takes around 8 seconds to load all the weights of the model using google collabs congurations. Comparison between the ensembling model with the individual models for both F1-score and accuracy.

Results
The normalized confusion matrix and AUC-ROC are shown in Fig. 6. Moreover, thenal model achieved 98.85% for top-3 accuracy, 96.61% for top-2 accuracy, 90.02% for top-1 accuracy, 98.4% for AUC, 90.34% for precision, and 90.00% for recall. We have involved the state-of-the-art approach for a comprehensive comparison with the proposed model, which is evaluated with the same metrics for the same purpose. Table 5 displays the results of previous works in addition to the top-5 ranked models in ICIS competition for 2019, which is an annual competition created to support the eort of researchers and developers worldwide to nd the solution to this complicated problem. As shown in Tab. 5, the proposed model achieved better performance than any other model in term of accuracy, precision, recall and AUC. Although some of these studies used the same number of networks such as [17,23]. Furthermore, using shallower networks like in [22] or using less number of networks like in [19,22] could be better in term of complexity and computational cost but it is signicantly worst than the proposed model. Moreover, one network could be more ecient for computers with low performance. So that many models were implemented using only one model, even by exploiting predened networks like ecientNetB4 [24] for this task or by building customized networks [25]; however, they still lack in the accuracy. In addition, the proposed model performs better than other models which are used to classify less number of lesions (two [22] or three [23]), or even for the models where dierent loss function was used for training [24].
On the other hand, the proposed ensembling technique was better than other techniques like simple average [19], voting technique [23] or using SVM for fusing networks [22]. In addition to models that applied transfer learning without retraining the whole layers where they preserve the original weight of networks. AUC and Recall results are much higher due to the powerful training, ensembling techniques and the power of the weighted categorical cross-entropy loss in treating imbalanced data.
Therefore, the proposed model can be helpful even if it is used to give multiple diagnosis choices to the dermatologists with high accuracy and keep the last decision for them.

Conclusions
Convolutional Neural Network applications for skin lesion detection are promising. Therefore, this study investigated the power of the proposed ensembling approach, where four kinds of CNN architectures are trained individually using transfer learning and weighted categorical-crossentropy loss function to rebalance the training process. After that, the proposed ensembling technique was used to fuse the last layer from every network to raise the accuracy signicantly and overcome their weaknesses, even for classes which lack in the training samples.
Averaging the last layer of each network, instead of taking the output of each one and averaging them, was able to increase the accuracy of the models form 86-87% to 90%. The proposed model was able to achieve high accu- 268 c 2020 Journal of Advanced Engineering and Computation (JAEC) VOLUME: 4 | ISSUE: 4 | 2020 | December   racy, although no other kinds of segmentation, de-noising or colour-space conversion steps were added to the classication pipeline. Indeed the proposed model needs large memory because of using four dierent architectures. However, this can be ignored regarding its good performance.
Finding useful methods for segmentation and de-noising will give better results due to removing harmful signals in addition to reduce overtting. Furthermore, the number of networks could be extended to include additional CNNs since choosing shallower networks could reduce overtting and decrease the required memory for the nal model as well.
software approach to hair removal from images. Computers in biology and medicine, 27(6), 533-543.