Terminated Ramp–Support Vector Machines: A nonparametric data dependent kernel

Share Embed


Descrição do Produto

Neural Networks 19 (2006) 1597–1611 www.elsevier.com/locate/neunet

Terminated Ramp–Support Vector Machines: A nonparametric data dependent kernel Stefano Merler, Giuseppe Jurman ∗ ITC-irst, Via Sommarive 18, I-38050 Povo, Trento, Italy Received 28 September 2004; accepted 28 November 2005

Abstract We propose a novel algorithm, Terminated Ramp–Support Vector Machines (TR–SVM), for classification and feature ranking purposes in the family of Support Vector Machines. The main improvement relies on the fact that the kernel is automatically determined by the training examples. It is built as a function of simple classifiers, generalized terminated ramp functions, obtained by separating oppositely labeled pairs of training points. The algorithm has a meaningful geometrical interpretation, and it is derived in the framework of Tikhonov regularization theory. Its unique free parameter is the regularization one, representing a trade-off between empirical error and solution complexity. Employing the equivalence between the proposed algorithm and two-layer networks, a theoretical bound on the generalization error is also derived, together with Vapnik–Chervonenkis dimension. Performances are tested on a number of synthetic and real data sets. c 2006 Elsevier Ltd. All rights reserved.

Keywords: Data-dependent kernel; Regularization; Support Vector Machines; Two-layer networks; Feature selection

1. Introduction What is the optimal kernel for a given learning problem? Answering this question is always a key step, since recent kernel methods, e.g. Support Vector Machines (SVM) (Cortes & Vapnik, 1995; Vapnik, 2000), require only to choose the optimal kernel for the task at hand, to optimize the eventual kernel parameters (for kernels other than the linear one) and to optimize the regularization parameter. These steps are usually tackled by user experience and literature knowledge, or by employing statistical model selection techniques. It is possible to use prior knowledge for developing an ad hoc kernel but in general there are no a priori justifications for the use of one kernel instead of another. This may affect the entire learning process, since it is well known that kernels other than the standard choices (Gaussian, polynomial) can improve the accuracy of the resulting classifier in many classification tasks (Cristianini & Shawe-Taylor, 2000). Moreover, model selection procedures might be misleading when the cardinality of the training set is rather small, as happens in the microarray ∗ Corresponding author. Tel.: +39 0461 314523; fax: +39 0461 314591.

E-mail address: [email protected] (G. Jurman). c 2006 Elsevier Ltd. All rights reserved. 0893-6080/$ - see front matter doi:10.1016/j.neunet.2005.11.004

data case: as highlighted in Simon, Radmacher, Dobbin, and McShane (2003), overfitting may occur when too many parameters (kernel type and parameters) are involved. In this paper we propose a solution which bypasses the kernel choice by automatically building the kernel itself directly by data. This advance moves toward one of the most important goals in the field of Machine Learning: to provide learning systems as automatic as possible in order to reduce the intervention of the human users to the minimum. As pointed out in Lanckriet, Cristianini, Bartlett, Ghaoui, and Jordan (2004), it would be desirable to explore model selection methods that allow kernels to be chosen in a more automatic way based on data. The algorithm is inspired by geometric considerations. Standard kernel methods are based on the assumption that each training example influences the labeling of a test example according to some decreasing function of its distance to the test point itself. Our approach focuses on the pairs of differently labeled examples. The key observation is that the global separation surface can be essentially determined by combining the separations between the points of those pairs. Indeed, the greater contribution is given by those pairs lying closer to the

1598

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

actual separation surface. The aim of the proposed method is thus to identify these pairs of relevant training examples among all the pairs of oppositely labeled training examples. This idea comes naturally as a generalization of the standard geometric approach of the SVM construction. In fact, the optimal hyperplane separating two linearly separable classes maximizes the distance between the hyperplane and the closest example. Alternatively, the optimal separating hyperplane is the one that maximizes the sum of the geometric margins of the pair of oppositely labeled examples (x + , x − ) with minimum functional margin (for example see the approach to SVM in Cristianini and Shawe-Taylor (2000)). + − As an example, the hyperplane passing through x +x and 2 having normal vector parallel to the vector connecting x + and x − where, among all the pairs of oppositely labeled examples, the pair (x + , x − ) is the one with minimum distance, is a good candidate to be the optimal separating one when data are linearly separable. Basically, the proposed algorithm constructs explicitly a mapping into a high-dimensional feature space, where every new feature is derived from a decision function separating a pair of oppositely labeled training points; afterward, a linear SVM is applied. A similar approach was first applied in Suykens and Vandewalle (1999b). A sounder theoretical background for the proposed algorithm can be given by embedding it in the framework of the Tikhonov regularization theory [for example see Cucker and Smale (2001), Evgeniou, Pontil, and Poggio (2000) and Poggio and Smale (2003)]. This approach has its core in defining a function vector space depending on a suitable kernel — the Reproducing Kernel Hilbert Space (Aronszajn, 1950) associated to the chosen kernel. The required solution is therefore realized by minimizing the Tikhonov functional. The novelty in our approach is based on two points: the former is the fact that a kernel can be univocally built from a given set of linearly independent functions (see Aronszajn, 1950), while the latter is that such a given set of functions can be straightforwardly derived as classifiers determined by all the pairs of oppositely labeled training points, thus inducing a data-driven kernel. As a remarkable consequence, the only free parameter of the proposed method remains the regularization one, a trade-off parameter between empirical error and a measure of the complexity of the solution. As previously outlined, this result may be particularly useful in the microarray data context. Data-dependent kernels have also been proposed by other authors. For instance, in Amari and Wu (1999) a method is introduced for modifying a preselected kernel in order to improve accuracy. More precisely, the training data are used for modifying a Gaussian RBF kernel on the basis of the Riemannian geometrical structure of the space induced by the kernel itself. In our method, the kernel is completely induced by the training data, with no need for a preselected kernel function. Finally, this algorithm has the advantage of providing a basic tool to evaluate the feature importance (Guyon & Elisseeff, 2003), as a noticeable by-product, and thus to perform a feature ranking. The algorithm appears to perform well both for

classification and for ranking purposes: we support this claim through experiments on synthetic and benchmark data sets. The paper is organized as follows: Section 2 deals with the theoretical background and sets the framework. Section 3 is the core of the paper where the algorithm is introduced and detailed; the following Section 4 is dedicated to the analysis of the algorithm. Experiments are described in Section 5. Finally, Section 6 deals with some implementation issues. 2. Support Vector Machines Let F : X → {−1, 1} be an unknown function and let N D = { pi ≡ (xi , yi )}i=1 be a set of training examples, where xi ∈ X and yi ∈ {−1, 1}. In order to approximate the function F the following algorithm can be considered: • Choose a Mercer kernel, i.e., a continuous, symmetric and positive definite function K : X × X → R. Examples of such Mercer kernels are the Gaussian kernel K (x, x 0 ) = 0 2 e−kx−x k /σ and the polynomial kernel K (x, x 0 ) = 1 + hx, x 0 id . Observe that for any given x, ¯ the function of a single variable K x¯ (x) = K (x, ¯ x) can be defined. • Choose the regularization parameter C ∈ (0, +∞). • Define f C : X → R as follows: f C (x) =

N X

ci K xi (x),

(1)

i=1

where ci = yi αi and the vector α is the solution to the quadratic programming problem:  N N   max X α − 1 X y y α α K (x , x ) i i j i j i j (2) 2 i, j=1 α∈R N i=1   subject to 0 ≤ αi ≤ C i = 1, . . . , N . • Define the classifier f˜C (x) = sign( f C (x)). This is the parametric approximation of the unknown function F(x). Typically, the regularization parameter C is optimized by employing statistical model selection procedures, e.g. crossvalidation. 3. Terminated Ramp–Support Vector Machines In this section we introduce the Terminated Ramp–Support Vector Machines (TR–SVM). By looking at any single training example pi , standard kernel methods are based on the assumption that the class label to be assigned to X is the class label yi of the point xi and the influence of pi in determining the class on X is some decreasing function of the distance from the point xi . Quite obviously, the choice of the kernel (and parameter estimation thereof) and the way in which the kernels are combined (i.e., the selection of the coefficients ci in Eq. (1)) determines the efficiency and the accuracy of the resulting classifier on the task at hand. Our method stems from the observation that it is possible to capture local information about the separation surface by looking at the pairs of examples belonging to different classes. For each pair of examples ( pi , p j ) with yi 6= y j , it is always possible to consider a submodel ki j (x) := k(x; pi , p j )

1599

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

Fig. 1. Geometric interpretation of the TR–SVM: once the pairs (xi , x j ) of oppositely labeled training examples (filled circles and squares) lying close to the actual separation surface (dashed line) have been identified, the separation surface can be reconstructed locally by means of terminated ramp functions (the submodels ki j ), and, finally, a global model can be obtained by linearly combining the submodels.

separating the two (yi ki j (xi ) > 0 and y j ki j (x j ) > 0). A global model can be built by linearly combining such submodels (see Fig. 1). As explained in Section 2, the key step in fitting a SVM is choosing the Mercer kernel, i.e., the internal product of a suitable high dimensional feature space. This can be done by selecting a suitable positively defined matrix, or, alternatively, by explicitly choosing the feature space and its feature mapping. In fact, given an explicit feature map φ(x), a Mercer kernel K (x, x 0 ) can be constructed via the definition K (x, x 0 ) = φ(x)T φ(x 0 ). This approach was first used in Suykens and Vandewalle (1999b), where the hidden layer of a multilayer network was taken as the feature map. According to this approach, our method consists in the explicit construction of a high dimensional feature vector where every feature is derived from the submodels ki j (x). Let I1 and I−1 be the sets of the indexes of the examples belonging to class 1 and −1 respectively and let T = Card(I1 × I−1 ) be the number of oppositely labeled pairs of training data: the feature mapping is explicitly given as examples1

φ(x): R x

d

−→ 7−→

T

R (k1 (x), . . . , k T (x)),

(3)

where for the sake of simplicity any kt (x) represents a generic ki j (x) submodel with i ∈ I1 and j ∈ I−1 . The induced kernel is given by K (x, x 0 ) = hφ(x), φ(x 0 )iRT .

(4)

The classifier can be realized by fitting a linear SVM to the mapped data {(φ(xi ), yi )}i=1,...,N (see Fig. 2) and, as expected, the solution PT is a linear combination of the submodels: ha, φ(x)i + b = t=1 at kt (x) + b. 1 In the case of noisy labels, for example if x = x and y 6= y , the i j j j associated decision model ki j (x) does not exist. However, it is sufficient to build the global model by excluding all the submodels associated to the noisy pairs.

Fig. 2. Terminated ramp functions determined by the pairs of oppositely labeled training examples (the k12 (x) and k32 (x) functions) and associated feature space, R2 , and feature mapping (see Eq. (3)). A TR–SVM is a linear SVM fitted to the mapped data.

3.1. The generalized terminated ramp functions The key step now is to define the submodels, i.e., the functions ki j (x). This consists in separating each pair of examples ( pi , p j ) with i ∈ I1 and j ∈ I−1 . To such an aim, we generalized the terminated ramp functions2 (Dally, Riley, & McConnell, 1993): as the classifier we identify the maximal margin hyperplane passing through the points pi = (xi , yi ) and p j = (x j , y j ), subjected to an upper bound on its absolute value. Let hwi j , xi + bi j be the separating hyperplane; fundamental identities in basic algebra yield that wi j satisfies the following equation: wi j = αi j (yi xi + y j x j ).

(5)

The coefficient αi j and the offset bi j can therefore be computed by imposing that the hyperplane passes through the two points pi and p j . The solution is given by:  y j − yi α = ij y hx , x j i − yi hxi , xi i − (y j − yi )hxi , x j i b = y j− αj (y hx , x i + y hx , x i). ij

i

ij

i

i

i

j

i

j

Finally, the terminated ramp function separating the examples pi and p j is obtained by applying a suitable squashing function σ : R → [−1, 1] to the hyperplane: ki j (x) = σ (hwi j , xi + bi j ). The adopted shape for σ is the following piecewise-linear function:  −1 if x < −1 if − 1 ≤ x ≤ 1 σ (x) = x  1 if x > 1, but different squashing functions σ may be employed. In Fig. 3 the kernels associated to three training data in a simple classification problem are displayed. The data-driven kernel in Eq. (4) seems able to capture the local nature of the 2 The terminated ramp function r : R → R is defined in the literature as follows: r (x) = 0 for x ≤ 0, r (x) = x for 0 < x < 1 and r (x) = 1 for x ≥ 1.

1600

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

the generalization error. We conclude the section by showing an application of TR–SVM for feature ranking. 4.1. Equivalence of TR–SVM and two-layer networks A TR–SVM is a non-linear SVM with kernel given by Eq. (7). By substituting Eq. (7) into Eq. (1) one obtains ! N N T X X X f C (x) = ci K xi (x) = ci kt (xi )kt (x) + 1 i=1

=

=

i=1

T X

N X

t=1

i=1

T X

t=1

! ci kt (xi ) kt (x) +

N X

ci

i=1

at kt (x) + b

(8)

t=1

Fig. 3. Kernels K xi (x) (see Eq. (4)) associated to three training examples xi . For the sake of clarity, each kernel K xi (x) is multiplied by the label yi of the corresponding training example xi . Empty circles represent data belonging to class 1, filled circles represent data belonging to class −1.

separation surface. In particular, when involving points close to the actual separating surface (x1 and x3 ) the kernel approaches a sigmoidal-like one (approximating the linear kernel), while the kernel associated to the point x2 is Gaussian shaped. Remark A sounder theoretical background for the proposed algorithm can be given by embedding it in the framework of the Tikhonov regularization theory, as happens for SVM. Classical results from this theory allow the construction of a regularized solution of a classification problem as a linear combination of a finite-dimensional class of base models (details are given in Appendix A.2). Within this framework, TR–SVM naturally emerge just by defining the finite-dimensional set of functions G 0: G 0 = {1} ∪ {ki j (x) | i ∈ I1 , j ∈ I−1 },

(6)

where the constant function 1 takes care of the global offset (see Example 1 in Section 4.1), that can be included directly in the kernel definition. For the sake of clarity, the parameter t is introduced as the index running through 0 and T = Card(I1 × I−1 ), so that G 0 can be rewritten as follows:

with at =

X

αi yi kt (xi ),

t = 1, . . . , T.

(9)

i∈N

This result shows that a TR–SVM can be also interpreted as a two-layer network with T hidden units. The parameters of the hidden units kt (x) are fixed a priori (not by learning) by the training data. This allows the optimization of the network’s weights at in an alternative way: an optimal strategy consists in minimizing an upper bound on the generalization error for twolayer networks depending only on the weights (Bartlett, 1998). It is worth noticing that this strategy leads to the same quadratic programming problem as in Eq. (2). Details are given in Merler (2002). Example 1. Let us consider the following problem in R:0 < x1 < x2 < · · · < x M < 1 belonging to class 1; x M+1 < x M+2 < · · · < x L < 0 and 1 < x L+1 < x L+2 < · · · < x N belonging to class −1. It is easy to see that F(x) = k1,L (x)+k M,L+1 (x)−1 (where ki, j is the submodel built on the points xi and x j ) allows the data to be separated. Without the introduction of the offset b (which depends on the introduction of the constant function 1 in the set G 0 in Eq. (6)) this simple data set would not be separable by any linear combination of terminated ramp functions. 4.2. Generalization error bound For γ > 0, we define

G 0 = {k0 (x) ≡ 1, k1 (x), . . . , k T (x)}. According to the analysis reported in Appendix A.2 and by adopting the identity as Gram’s matrix the resulting kernel is given by

γ

D =

N 1 X I [yi F(xi ) < γ ] , N i=1 γ

4. TR–SVM and two-layer networks

where I [P] returns 1 if predicate P is true, 0 otherwise.  D is the fraction of training examples classified with margin less than γ ,  0D is the empirical error. For two-layer networks, the following bound on the generalization error holds see part 1 of Theorem 28 in Bartlett (1998):

TR–SVM are equivalent to two-layer networks. In this section we prove this statement and we provide a bound for

Theorem 1. Let σ : R → [−1, 1] be a non-decreasing function. Define the class K of function on Rd as K = {x →

K (x, x 0 ) = K G 0 ,I (x, x 0 ) = 1 +

T X

kt (x)kt (x 0 ).

(7)

t=1

1601

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

Fig. 4. Training (dashed lines) and test (solid lines) set errors estimated by 10-fold cross-validation versus the regularization parameter C (in logarithmic scale) for (a) the sin data set, (b) the gaussians data set.

σ (hw, xi + w0 ) : w ∈ Rd , w0 ∈ R}, and define ( ) T T X X F= at kt : T ∈ N, kt ∈ K, |at | ≤ A t=1

t=1

for A ≥ 1. Suppose 0 < δ < 1/2 and 0 < γ ≤ 1. Then with probability at least 1 − δ every F in F has: γ

P[y F(x) ≤ 0] ≤  D s     c A2 d A 2 + log N + log(1/δ) , log N γ γ2 for some universal constant c. Theorem 1 states that the generalization performance depends on the size of the weights at rather than the number of weights and it is important since it shows that two-layer networks (and our TR–SVM as well) can perform well even though their VC dimension (see Vapnik & Chervonenkis, 1968), which depends on the number of weights, is infinite. The theoretical support for this claim also comes straightforwardly from Theorems 4 and 5 in Burges (1998) stating that the VC dimension of RBF–SVM is infinite and the VC dimension of polynomial SVM grows rapidly to infinity with the polynomial degree: nevertheless, as underlined at the beginning of Section 6 in the same paper, in spite of this, SVM usually exhibit good generalization performances. On the basis of Theorem 1, an analogous result for TR–SVM can be derived: Corollary 1. Suppose 0 < δ < 1/2 and 0 < γ ≤ 1 and let n S be the number of Support Vectors. Then with probability at least 1 − δ every TR–SVM has: γ

P[y F(x) ≤ 0] ≤  D v ! u   u c T 2C 2n2 d T Cn S S 2 t log N + log(1/δ) + log N γ γ2 for some universal constant c.

Proof. The thesis follows straightforwardly from Theorem 1 by considering that T T X X X αi yi kt (xi ) ≤ T Cn S |at | = t=1 t=1 i∈I S

since αi ≤ C and |kt (xi )| ≤ 1.



Corollary 1 states that the generalization error is bounded by the sum of two terms. The former is the proportion of training data with margin less than γ . The latter increases with C and, as for standard SVM, increases with the number of Support Vectors. We recall that a Support Vector is a training sample with non-zero Lagrange coefficient αi in Eq. (2). It increases also with the number of terminated ramp functions T . However, after having seen the data, it is always possible to choose a suitable value of C (scaling with 1/T ) making it independent from T . We conclude this paragraph by analyzing the effect of the regularization parameter C on the solution. In Fig. 4 are shown the training and test set errors (estimated by 10-fold crossvalidation) of TR–SVM obtained by varying C on the sin and gaussians data sets (described in Appendix B). The training errors are decreasing functions of C in both cases and go to zero for large enough values of C. The test set error of the sin data set is also a decreasing function of C. This behavior is not surprising since we are considering a noiseless data set. The test set error of the gaussians data set instead shows the classical pattern of the regularizing methods, according to the analysis reported in this section. 4.3. Support vectors and sparseness of the solution If we look at the proposed model as a two-layer network we may ask whether or not the solution is sparse, i.e., how many weights at in Eq. (8) are different from zero? At the same time, we may be interested to understand the sparseness of the SVM

1602

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

assess the relevance of each submodel in constructing the global model. 4.4. TR–SVM for feature ranking The described algorithm can also provide a strategy to rank the feature importance. The absolute value of the weights |at | in Eq. (8) evaluates the importance of each terminated ramp function kt (x) in contributing to the global model. Thus, as in the linear SVM case, the vector wt in Eq. (5) can be interpreted as a feature importance indicator for the terminated ramp function kt . Finally, for each feature j, a global indicator can be defined as follows: hj =

T X

( j)

|at |

t=1

Fig. 5. Histogram of the weights at for the ringnorm data set (see Section 5.2). A similar shape was observed in all the considered data sets.

in Eq. (1), i.e., how many Lagrange multipliers αi are strictly positive? Let us start with the second question: this is a key question since the number of Support Vectors (the training examples with non-zero Lagrange multiplier) is related to the generalization error (see Section 4.2) as for standard SVM (Vapnik, 2000). Currently, we cannot bound the number of Support Vectors. However, in Section 5 we will show on both synthetic and benchmark data sets that the solution is, in general, very sparse. Going back to the first question, it is worth noticing that the weights at depend only on the Support Vectors. In fact, Eq. (9) can be rewritten by considering only the Support Vectors: X at = αi yi kt (xi ), t = 1, . . . , T (10) i∈I S

where I S is the set of indexes of the non-zero Lagrange multipliers. In general, there is no reason why Eq. (10) should be zero and thus the two-layer network (8) is not sparse, even though most of the weights at are very close to zero (see Fig. 5) and contribute little to form the solution: the corresponding hidden units could be left out of the model without too much degrading of the accuracy. This claim is also supported by the results reported in Section 6.1. In fact, we did not implement an ad hoc strategy to obtain a sparse solution since, in this regard, the title of Bartlett’s paper (Bartlett, 1998) is explicit: the size of the weights is more important than the size of the network. In the following, we will measure the sparseness of the TR–SVM exclusively in terms of the number of Support Vectors. The role of the Support Vectors in constructing the global model can be clarified by considering again Eq. (10): in practice, the weight of each submodel kt (x) is determined by the Support Vectors. In particular, higher positive values of weights correspond to the submodels that correctly classify a high fraction of the Support Vectors (yi kt (xi ) > 0). The rationale is that only a subset of (relevant) examples, namely the Support Vectors, is used to

wt kwt k

j = 1, . . . , d,

(11)

( j)

where wt is the jth component of the vector wt . As for standard Recursive Feature Elimination (RFE for ( j) short) (Guyon, Weston, Barnhill, & Vapnik, 2002), using wt to rank features can fail dramatically when the features have significantly different ranges. This issue can be dealt with by preprocessing the data through a suitable standardization technique. Alternatively, it may be worthwhile including information about mean and variance of the feature j into Eq. (11). 5. Experiments To illustrate the qualitative properties of TR–SVM, we report on a series of experiments performed on three synthetic data sets. Data sets employed are described in Appendix B and results are reported in Section 5.1. Another series of large scale experiments was conducted on benchmark data sets in order to compare performance of TR–SVM and SVM with Gaussian kernel (RBF–SVM). Data sets and results are described in Section 5.2. 5.1. Synthetic data sets We used the spiral data set in order to test the capability of TR–SVM to separate points of a highly nonlinear problem. On this data set, a RBF–SVM is recognized to work properly. The separation surfaces obtained for this task and a modified version of the spiral data set are shown in Fig. 6. Notice that the algorithm is able to separate the two classes in both cases. From an applicative point of view, one of the major advantages of the TR–SVM is that they do not require the tuning of any kernel parameter to build non-linear separation surfaces. In Fig. 7 the results on the sin data set are reported. Despite the fact that the separation surface is fairly varied, the two classes are not overlapping. In this case a hard strategy (C → +∞) is the best choice for both TR–SVM and RBF–SVM (see Fig. 4). The separation surfaces obtained by applying the two methods are shown in Fig. 7(a) and (b). We can observe that both methods reconstruct the sinusoidal part of the separation surface, compatible with the relatively low number of training

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

1603

Fig. 6. Separation surfaces obtained with TR–SVM on: (a) the original spiral data set, (b) a slightly modified version of the spiral data set obtained by injecting some noise in order to make less symmetric the structure of the training data. The squares around the training examples indicate the Support Vectors.

Fig. 7. Results on the sin data set: (a) Separation surface obtained with a TR–SVM (solid line) and exact separation surface (dashed line). (b) Separation surface obtained with a RBF–SVM. (c) Support Vectors, indicated by squares around the training points, obtained with a TR–SVM. (d) Support Vectors obtained with a RBF–SVM.

points. We used 10-fold cross-validation to estimate the kernel parameter of the RBF–SVM. The resulting kernel parameter,

however, makes the RBF–SVM unable to reconstruct properly the linear part of the problem, while the data-dependent kernels

1604

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

Fig. 8. Results for the gaussians data set. (a) Separation surface obtained with a TR–SVM (solid line) and Bayes separation surface (dashed line). (b) Separation surface obtained with a RBF–SVM. (c) Support Vectors, indicated by squares around the training points, obtained with a TR–SVM. (d) Support Vectors obtained with a RBF–SVM.

provided by the TR–SVM seem to be flexible enough to reconstruct properly this portion of the separation surface. This results in a shortcoming of the RBF–SVM with respect to the TR–SVM in terms of accuracy on the test set (91.8% for the TR–SVM, 90.4% for the RBF–SVM). The Support Vectors found with the two methods are shown in Fig. 7(c) and (d). It is important to notice that the Support Vectors found by the TR–SVM are always very close to the separation surface, while some of those found by the RBF–SVM are spread far from the separation surface. This is due to the fact that RBF–SVM tend to vanish far from the kernel centers, where the offset becomes the dominant term. As a consequence, some kernels have to be placed also far from the separation surface. Apparently, the TR–SVM does not exhibit this shortcoming. The number of Support Vectors for the TR–SVM and the RBF–SVM is 45 and 57 respectively. In Fig. 8 the results on the gaussians data set are reported. This is a noisy data set, and a soft strategy was therefore applied in this case. The regularization parameter was chosen by 10fold cross-validation for both TR–SVM and RBF–SVM (for the last we used 10-fold cross-validation also to estimate the kernel width). The separation surfaces obtained by applying the two methods are shown in Fig. 8(a) and (b). In this case,

no significant differences are observable in reconstructing the separation surfaces. However, the TR–SVM results in being slightly better than the RBF–SVM in terms of accuracy on the test set (95.5% for the TR–SVM, 94.3% for the RBF–SVM). This fact can be explained by observing that the RBF–SVM is not able to correctly classify a portion of the domain (top-right corner of Fig. 8(b)) in which the classification is determined by the sign of the offset since the kernels centered on the Support Vectors (see Fig. 8(c) and (d)) vanish. The number of Support Vectors for the TR–SVM and the RBF–SVM is 81 and 94 respectively. 5.2. Benchmark data sets In order to compare the performance of TR–SVM and RBF–SVM we used 10 benchmark data sets: breast cancer, diabetes, german, heart, thyroid, titanic, twonorm, ringnorm, waveform from the UCI repository and the banana data set.3 Each data set was partitioned 10 times in the training and test set. 3 Data sets, including training/test set partitions, are available at the address http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm.

1605

S. Merler, G. Jurman / Neural Networks 19 (2006) 1597–1611

Table 1 Generalization errors and number of Support Vectors (with standard deviations) of RBF–SVM and TR–SVM on benchmark data sets (columns E and n S respectively) Data set diabetes breast cancer heart banana german twonorm titanic thyroid ringnorm waveform

RBF–SVM E

nS

TR–SVM E

nS

23.3 ± 1.5 28.2 ± 4.2 17.3 ± 3.2 10.2 ± 0.04 23.0 ± 1.8 2.7 ± 0.2 22.5 ± 1.1 5.2 ± 2.9 1.6 ± 0.1 10.2 ± 0.4

245 ± 6 113 ± 6 81 ± 6 82 ± 5 411 ± 12 76 ± 4 96 ± 15 31 ± 4 151 ± 9 159 ± 7

23.2 ± 0.6 29.2 ± 4.2 18.0 ± 2.7 10.4 ± 0.04 23.9 ± 1.5 2.4 ± 0.2 21.7 ± 0.4 4.2 ± 2.8 3.1 ± 0.5 10.8 ± 0.5

251 ± 7 112 ± 4 64 ± 7 121 ± 8 354 ± 18 111 ± 5 88 ± 10 13 ± 2 77 ± 4 89 ± 11

p-value

AdaBoost

N

Nts

d

0.79 0.40 0.39 0.21 0.21 0.08 0.05 0.008 0.001
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.