A multiple classifier system for early melanoma diagnosis

Share Embed


Descrição do Produto

Artificial Intelligence in Medicine 27 (2003) 29±44

A multiple classi®er system for early melanoma diagnosis

Andrea Sbonera,*, Claudio Ecchera, Enrico Blanzierib, Paolo Bauerc, Mario Cristofolinid, Giuseppe Zumianic, Stefano Fortia a

ITC-irst, Centre for Scienti®c and Technological Research, Via Sommarive 18, Povo, Trento 38050, Italy b Department of Information and Communication Technology, University of Trento, Via Sommarive 14, Povo, Trento 38050, Italy c Department of Dermatology, S. Chiara Hospital, L.go Medaglie d'Oro 8, Trento 38100, Italy d Lega per la Lotta contro i Tumori, Sezione Trentina, Corso 3 Novembre 134, Trento 38100, Italy Received 21 January 2002; received in revised form 14 August 2002; accepted 27 September 2002

Abstract Melanoma is the most dangerous skin cancer and early diagnosis is the key factor in its successful treatment. Well-trained dermatologists reach a diagnosis via visual inspection, and reach sensitivity and speci®city levels of about 80%. Several computerised diagnostic systems were reported in the literature using different classi®cation algorithms. In this paper, we will illustrate a novel approach by which a suitable combination of different classi®ers is used in order to improve the diagnostic performances of single classi®ers. We used three different kinds of classi®ers, namely linear discriminant analysis (LDA), k-nearest neighbour (k-NN) and a decision tree, the inputs of which are 38 geometric and colorimetric features automatically extracted from digital images of skin lesions. Multiple classi®ers were generated by combining the diagnostic outputs of single classi®ers with appropriate voting schemata. This approach was evaluated on a set of 152 digital skin images. We compared the performances of multiple classi®ers (2- and 3-classi®er groups) between them and with respect to single ones (1-classi®er group). We further compared the classi®ers' performances with those of eight dermatologists. Classi®ers' performances were measured in terms of distance from the ideal classi®er. Compared with 1- and 2-classi®er groups, performances of 3-classi®er systems were signi®cantly higher (P < 0:0005 and P < 0:001, respectively). No statistically signi®cant differences were found between the 1- and 2-classi®er groups (P ˆ 0:352). While the dermatologists group showed a level of performances signi®cantly higher than the 1-classi®er systems (P < 0:020), no differences were found between the multiple classi®er groups and the dermatologists groups, indicating comparable performances. This work suggests that a suitable

*

Corresponding author. Tel.: ‡39-461-314-425; fax: ‡39-461-810-851. E-mail address: [email protected] (A. Sboner). 0933-3657/02/$ ± see front matter # 2002 Elsevier Science B.V. All rights reserved. PII: S 0 9 3 3 - 3 6 5 7 ( 0 2 ) 0 0 0 8 7 - 8

30

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

combination of different kinds of classi®ers can improve the performances of an automatic diagnostic system. # 2002 Elsevier Science B.V. All rights reserved. Keywords: Decision support system; Classi®er combinations; Cost-sensitive learning; Early melanoma diagnosis; Machine learning

1. Introduction Melanoma is the most dangerous skin cancer. Even though melanoma represents only 5% of skin tumours overall, it causes 91% of the deaths. It develops from melanocytes, skin cells that produce the protective pigment melanin. Melanoma is capable of deep invasion. This is its most dangerous characteristic, as melanoma can spread widely over the body via the lymphatic vessels and the blood vessels. For this reason, the early diagnosis of melanoma is the key factor for the prognosis of this disease. The incidence of melanoma is increasing world-wide [20,26]. The diagnosis of this kind of cancer is dif®cult and requires a well-trained dermatologist, because the early lesion can have a benign appearance. Several studies have shown that the diagnostic accuracy of a trained dermatologist is about 75% for early melanomas but reduces to 30% for non-specialists [10]. The usual clinical practice of melanoma diagnosis is a visual inspection of the skin. Epiluminescence microscopy (ELM) is a method which by using oil at the skin±microscopy interface, greatly increases the morphological details that are visualised, providing additional diagnostic criteria to the dermatologist [19,24]. This inspection is driven by coded procedures, which investigate several features of the lesion and are particularly useful to the experienced dermatologist. One of the methods is the so-called ABCD rule, whereby the physician assesses the asymmetry of the lesion, the irregularity of the border, the presence and the distribution of the colour and the presence of some differential structures (brown globules, black dots, radial streaming and pseudopods, etc.) [23]. However, this clinical procedure is not always suf®cient to allow recognition of malignant lesions, making it necessary to consider other information, such as age, sex and sun exposure derivable from medical records. The diagnostic performance is usually evaluated in term of sensitivity and speci®city. The measures are de®ned as: sensitivity ˆ

#true positives #true positives ‡ #false negatives

specificity ˆ

#true negatives #true negatives ‡ #false positives

where #true positives and #false negatives are the number of melanomas correctly classified and incorrectly classified as nevi, respectively. Similarly, #true negatives and #false positives are the number of nevi correctly classified and incorrectly classified as melanomas, respectively. Binder et al. [2] recently reported performance of the ABCD rule assessed between 81% of sensitivity and 77% of specificity.

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

31

Given the characteristics of the disease, screening of the population could be a viable way of minimising the risks. However, speci®c training of dermatologists for melanoma diagnosis takes time and resources [3]. On the other hand, speci®cally trained dermatologists face daily the task of critical diagnosis of early malignant skin lesions. The levels of speci®city and sensitivity required by the two tasks are different. In the former case, high sensitivity is preferred while in the second case, speci®city and sensitivity should be comparable. In recent years, the introduction of digital systems in medical practice has changed the diagnostic approach, especially in dermatology, where the image is fundamental for the diagnosis. In particular, digital epi-luminescence microscopy (D-ELM) using digital technology allows the determination, in vivo, of several morphological and structural characteristics of skin lesions. D-ELM has allowed the development of several automated systems for early diagnosis of melanoma. In fact, declarative knowledge about the diagnostic domain seems insuf®cient to provide acceptable levels of performance (see Binder et al. [2]). Learning techniques that rely on the information contained in available data have been the preferred options of many systems. Therefore, these systems allow processing the images to obtain quantitative parameters not even recognised by the dermatologist. In particular, D-ELM permits the computation of high order features (like fractal dimension of colours and average hue of the lesion). After the ®rst experiences with SkinView [8,11], several automatic systems were proposed for the early diagnosis of melanoma, using different approaches. The classi®cation systems generally used in this ®eld are discriminant analysis, decision trees and neural networks [1,4,14,15,18,28±30]. Despite research efforts, there are no standard procedures from the medical specialist's perspective and no standard systems from the technological point of view to diagnose precisely early malignant melanoma. In this kind of medical application, it is important to reduce false negatives, because they represent melanomas diagnosed as benign nevi. Therefore, it is more important to improve the sensitivity, i.e. to recognise the greatest number of melanomas, without misclassifying too many nevi. The need to use data and learning techniques in order to improve performance requires a proper choice of the learning algorithms and of their statistical validation. The problem is dif®cult given the relative paucity of lesion data and consequently the low quantity of training data available and the imbalance between the classes. The techniques should easily integrate features extracted from the image with features of a different kind. We evaluate the performances of some classi®ers, which eventually produce the suggested diagnosis of a pigmented skin lesion (PSL) starting from features extracted from a digital image. More speci®cally, we combined three kinds of classi®ers in order to improve the performance of single systems, especially in the recognition of malignant lesions. We will show in the next section our approach in the choice of classi®er combinations. 2. Material and methods Melanoma diagnosis system (MEDS) uses automatic classi®cation given by three different kinds of classi®ers. The classi®cation is based on features extracted from digital

32

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

images clinically acquired. Hereinafter, we brie¯y describe the architecture and the components of the system. MEDS architecture has four main components:  the D-ELM image acquisition component, the main function of which is the acquisition of an image of the pigmented skin lesion;  the image processing component, which analyses the digital image producing a vector of features;  the multi-classifier component, which applies and combines linear discriminant analysis, decision tree and k-nearest neighbour;  the graphical user interface component. The system functional architecture is shown in Fig. 1. 2.1. Image acquisition The ®rst component performs the acquisition of D-ELM images. We used a Leica WILD M-650 stereomicroscope, with a SONY 3CCD DXC-930P colour camera. The camera is linked with an AT-Vista Videographics acquisition board, which allows digitising the analogue image from the microscope (DBDERMO MIPSÐDell'Eva/Burroni Studio, Florence/Siena, Italy). The digital images are then stored in a database for further

Fig. 1. Overview of the overall system.

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

33

processing, and display to the user. The digital image size is 768  576 pixels with 24-bit colour depth. 2.2. Image processing The image processing component elaborates the digital image in order to extract a vector of features. We used the morphometer Leica Q570 to perform image analysis. This device allows de®ning proper algorithms for image analysis, providing basic image processing macros (such as adding or subtracting two images, applying kernels, converting colour spaces, thresholding and so on). We analysed red, green and blue images of the RGB colour space and hue, saturation and value images of the HSV colour space, which is obtained by standard conversion formulas (see Gonzales RC and Woods RE [17]). Hue refers to the main wavelength of the colours: for instance, hue values of 0, 84 and 168 represent red, green and blue colours, respectively. Saturation denotes the purity, or the amount of whiteness in the colour. In the saturation image, pixels of pure colours, i.e. pixels in which at least one primary colour is missing, have a saturation value of 255; while pixels without colours (i.e. black, grey and white) have a value of 0. Value represents the brightness of the image. In the value image the grey level of a pixel is that of the brightest pixel among the corresponding pixels (same coordinates) in R, G, and B images. This choice allows us to improve image understanding and analysis, because the HSV colour space is closer to the human perception of colours than RGB colour space. The image processing component performs two functions: segmentation and feature extraction. The purpose of segmentation is to de®ne the region of the lesion, separating it from the normal skin. The output of the procedure is the so-called binary plane, i.e. a 1-bit image that separates regions using ones (the lesion) and zeros (the normal skin). The binary plane is the basis to compute the vector of numerical features, which is the purpose of the feature extraction module. The features are grouped in geometric, morphologic and colorimetric ones. Table 1 shows the list of computed features. 2.2.1. Geometric and morphologic parameters The geometric and morphologic parameters measure the dimensions of the lesion (area and perimeter) and its shape characteristics (roundness, aspect ratio, fullness ratio). These parameters represent the basic features of a lesion and are directly computed from the binary plane. 2.2.2. Colorimetric parameters The colorimetric features quantitatively describe concepts such as the presence of speci®c colours, their distribution and granularity, the irregularity of the pigmentation on the border of the lesion, etc. A chromaticity characterisation of each lesion is obtained by identifying four regions of interest: namely, dark- and light-brown (DB, LB), reddish (RD) and whitish veil (WV) regions. Each of these regions is characterised by a speci®c combination of different ranges of hue, saturation and value and it is represented by a ®nal binary mask obtained by ``ANDing'' together the three masks computed from hue, saturation and value images. For example, a hue interval of [13,167], a saturation interval of [120,255], and a value interval of [0,137] identify the DB region. The ranges are

34

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

Table 1 Features extracted by the image processing module Geometric and morphologic parameters Area Perimeter Roundness Aspect ratio Fullness ratio Colorimetric parameters Global: Average hue of the lesion Standard deviation of hue of the lesion Average saturation of the lesion Standard deviation of saturation of the lesion Fractal dimension of the blue image Fractal dimension of the green image Fractal dimension of the red image Fractal dimension of the hue image Fractal dimension of the saturation image Local: Whitish veil zone Reddish zone Brown zone Number of colours Distribution: Distance of red image centre of gravity from the lesion centre of gravity Distance of the green image centre of gravity from the lesion centre of gravity Distance of the blue image centre of gravity from the lesion centre of gravity Average distance of the RGB images centre of gravity from the lesion centre of gravity Distance of the hue image centre of gravity from the lesion centre of gravity Distance of the saturation image centre of gravity from the lesion centre of gravity Distance of the value image centre of gravity from the lesion centre of gravity Average distance of the HSV images centre of gravity from the lesion centre of gravity Distance of the coloured zones centre of gravity from the lesion centre of gravity Normalised distance of the coloured zones centre of gravity from lesion centre of gravity Hue difference on the border Saturation difference on the border Hue value on the borderÐaverage Hue value on the borderÐvariation coefficient Saturation value on the borderÐaverage Saturation value on the borderÐvariation coefficient Discretisation of the whitish veil zone Discretisation of the reddish zone Discretisation of the light-brown zone Discretisation of the dark-brown zone

established according to a group of expert dermatologists evaluating a different set of images acquired by the same device in the same conditions of our present study. The extracted features resemble the ABCD rule used by a dermatologist to diagnose a skin lesion [23]. We stressed our attention on the colorimetric features, because of their

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

35

dif®culty of determination by dermatologists, while easily available when using an automatic system, and of their great diagnostic relevance, especially for the early lesions. The extracted features are then used by the multi-classi®er component to perform its tasks. 2.3. Multi-classi®er We apply three classi®cation systems to the problem of early diagnosis of melanoma. Our system then combines their outputs so as to enhance the overall performance in recognising malignant lesions. MEDS uses three kinds of classi®ers and their combinations: linear discriminant analysis (LDA), decision trees and k-nearest neighbour (k-NN). The choice of these classi®ers was made for several reasons. In particular, the output of the three methods preserves comprehensibility: a probability for LDA, a rule for decision tree and the nearest cases for k-NN. Moreover, the voting schemata we adopted to combine them are easily understandable and can be set as a parameter. In this work, we used a modi®ed form of the nearest neighbour algorithm that we called k-NN-Uni. Given that the more important measure to improve is sensitivity, the decision rule of the k-NN-Uni is ``search the database of images retrieving the nearest k cases to the one you are evaluating. Then, if there is a melanoma among them, classify the new case as melanoma''. This rule allows improving the sensitivity of the k-NN classi®er, but decreases the speci®city. Therefore, we evaluate this kind of algorithm for different values of k to explore the best compromise between sensitivity and speci®city. In particular, we tested standard k-NN with k equals to 1, 3, 5, 7 and 9, whereas k ranged from 2 to 9 as regards k-NN-Uni. 2.3.1. Cost-sensitive learning The nature of the learning problem, namely imbalanced classes and different relative importance given to misclassi®cation errors, requires adopting techniques of cost-sensitive learning. In order to improve sensitivity we altered the prior probabilities of the classes as described in [21]. We adopted different strategies for the three classi®ers. The prior probabilities for the LDA were considered equal for each class, despite their imbalance. Discriminant analysis was performed via a multivariate analysis on features selected by means of an univariate analysis. In other words, we ®rst selected the signi®cant features by means of an univariate analysis on each training set (see Section 3.2). Then we performed a multivariate analysis using only those features. This allows reducing the dimensionality of the feature space (average number of features in the training partitions: 21 out of 38; see Section 3.2). For the decision tree, we adopted C4.5 [25] and we performed a preprocessing on data increasing the weight of malignant melanomas as described in [6] and reported by Kukar et al. [21]. Finally, we adopted the Euclidean metric for the k-nearest neighbour (k-NN) and also for the high sensitivity version of the nearest neighbour algorithm (k-NN-Uni), after normalising the 38 features values in the interval [0,1]. We further combined the single classi®cation systems in such a way to improve the performances and speci®cally the sensitivity. The rules we used are straightforward: if at least one of the classi®ers diagnoses a melanoma the output of the combination is melanoma (schema ``1/2'' for the combination of two classi®ers and schema ``1/3'' for what concerns the combination of three classi®ers), which is equal to a total agreement on benign lesions for classifying the new case as a nevus. However, when the combination of

36

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

three classi®ers involved k-NN-Uni, we adopted the majority rule (schema ``2/3''), because of the strong unbalancing toward sensitivity of this classi®er. Due to the voting schemata ``1/2'' and ``1/3'', it should be noted that the sensitivity of the combined system cannot be smaller than the best single classi®er, while the speci®city cannot become greater than the worst single classi®er. 2.3.2. Classi®ers Implementation We used SPSS statistical package (release 10.0, SPSS Inc., Chicago, IL, USA) to learn the weights for what concerns LDA, while we used C4.5 to induce the decision trees. For what concerns k-NN and k-NN-Uni, we developed the algorithms in Visual C‡‡ and provide a link to the database of images, in order to retrieve and display the nearest cases to the user. 2.4. Interface Three principal graphical user interfaces (GUIs) compose the interface of MEDS. The acquisition GUI allows the physician to interact with the D-ELM acquisition component. The image processing GUI shows the results of the image processing analysis. The classi®ers GUI shows the results of the classi®cation process. The multi-classi®er diagnosis is displayed, along with an explanation in terms of probability (LDA), rules (decision tree) and similar cases (k-NN). 3. Evaluation Given the main goal of combining classi®ers to enhance the performances of the single ones, we performed different validations of MEDS. Our approach for the assessment of the effectiveness of the system is the evaluation of classi®ers' sensitivity and speci®city on a set of real-world cases, and the comparison of the system against the performance of eight dermatologists on the same dataset. Hereinafter, we present the dataset and the evaluation procedure. 3.1. Medical dataset The overall performance was assessed by means of a database of digital images and histological diagnosis. This dataset is composed by 152 D-ELM skin images acquired at the Department of Dermatology, S. Chiara Hospital, Trento. The lesions were excised and then diagnosed by pathologists as 42 melanomas and 110 nevi. The diagnosis of the pathologist is usually considered the gold standard. Our dataset did not include dysplastic lesions. Breslow's thickness is a clinical parameter linked to the prognosis of the disease [7]. It can be determined only after the excision by histological analysis. For example, the 5-year survival for melanomas thinner than 0.85 mm is 99%, for those between 0.85 and 1.70 mm is 94%; for those between 1.70 and 3.65 mm is 78% and for those thicker than 3.65 mm is 42% [12]. The average Breslow's thickness for our lesions is 1:0  0:7 mm, and the 90% of them are thinner than 1.70 mm. This fact con®rms the earliness of the involved melanomas.

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

37

3.2. Evaluation procedure One of the major issues in evaluating the performances of a classi®er is to obtain an estimation of the behaviour with respect to new cases. We have only a limited set of known cases, as happens in most situations. Therefore, using a technique that solves this problem is needed. The training and test method consists of dividing the cases into two sets: one is used in the learning phase of the classi®er (the training set), and the other is used only for the testing purpose. The testing cases represent the new cases in the real world; therefore, we have an estimation of the actual performances. Unfortunately, this is true only if the number of training cases is suf®cient to cover the whole population. The 10-fold cross-validation allows evaluating the ``real'' performances of the system by means of iterated training and test evaluations. In this case, we ®rst randomly divide the cases in 10 non-overlapping sets. Each of those sets represents the test set, while the remainder constitute the training set. Next, we perform the training and test evaluation and evaluate the sensitivity and the speci®city for each of the 10 sets. Finally, we average the results for the 10 partitions obtaining an estimation of the real performance of the classi®ers. Sensitivity and speci®city are two very important parameters for performances evaluation of a classi®er (either human or computerised). A comparison based on only one of these two parameters (e.g. sensitivity without speci®city, or vice versa), may result in misleading interpretation of the results. On the other hand, a comparison based on two parameters, which are correlated to some extent, is dif®cult to understand fully. Using accuracy, a single parameter taking into account both speci®city and sensitivity, is not a suitable approach because of the imbalance between the classes (110 benign lesions versus 42 malignant lesions). Therefore, in this study we de®ned a new measure to compare different classi®ers that enable us to give a simple estimation of how useful one classi®er is with respect to another. Given that the ideal classi®er has both sensitivity and speci®city equal to 1.0, we de®ne the distance of a real classi®er from the ideal one (dclass) in this way: q dclass ˆ …1 Se†2 ‡ …1 Sp†2 where Se and Sp are sensitivity and specificity of the real classifier, respectively. If we plot sensitivity (X-axis) versus specificity (Y-axis) for each classifier, dclass can be readily interpreted as the Euclidian distance of a point from the top-right corner, which represents the ideal classifier. The less the distance, the better the classifier. By using this parameter instead of accuracy, we can carry out the comparison between classifiers in an accurate but intuitive way, avoiding the unbalanced classes problem. 4. Results Fig. 2 shows the cross-validated results for the 1-, 2- and 3-classi®er systems. The smaller the classi®er's distance from the top-right corner, the better the classi®er's performance. We can note that the single systems cover a wide range from the top-left

38

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

Fig. 2. Classi®ers performances.

part of the plot to the bottom-right. LDA, decision tree and k-NN classi®ers are characterised by sensitivity ranging from 0.35 to 0.68 and speci®city ranging from 0.83 to 0.97. For what concerns k-NN-Uni, the sensitivity is very high (from 0.69 to 1.00 as k increase), while the speci®city signi®cantly decreases as k becomes greater (from 0.75 to 0.20), according to the adopted modi®ed decision rule, which strongly promotes sensitivity. Concerning the 2-classi®er systems, we have a relative great improvement of sensitivity coupled with a speci®city decrease (see Table 2), as expected because of the voting schema ``1/2''. In this case too, combinations involving k-NN-Uni lead to very high sensitivity values and to extremely low speci®city ones. The 3-classi®er systems form a cluster in between the extreme values of sensitivity and speci®city of the other classi®ers (see Fig. 2). Table 2 Classi®ers systems Classifiers groups

#Classifiers

Sensitivity

S.D.

95% CI

Specificity

S.D.

95% CI

1-Classifier 2-Classifier 3-Classifier Triple-NN Triple-Uni

15 27 13 5 8

0.73 0.88 0.81 0.84 0.80

0.23 0.11 0.04 0.01 0.04

0.62±0.84 0.84±0.92 0.79±0.83 0.83±0.85 0.77±0.82

0.66 0.54 0.74 0.67 0.79

0.28 0.23 0.08 0.02 0.06

0.51±0.80 0.46±0.63 0.70±0.79 0.65±0.69 0.75±0.83

The average sensitivity and speci®city for the three groups and two subgroups of classi®ers and their combinations. S.D. and CI stand for ``standard deviation'' and ``con®dence interval'', respectively.

A. Sboner et al. / Artificial Intelligence in Medicine 27 (2003) 29±44

39

Table 3 Comparison among classi®ers group Group 1

Group 2

Mean rank 1

Mean rank 2

P-value

1-Classifiers 1-Classifiers 2-Classifiers

2-Classifiers 3-Classifiers 3-Classifiers

23.93 20.67 24.78

20.15 7.38 11.62

0.352
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.