PLS_Cluster: a novel technique for cluster analysis

July 7, 2017 | Autor: Douglas Rutledge | Categoria: Analytical Chemistry, Cluster Analysis, Self Organization, Statistical Properties

Share Embed

Denunciar este link

Descrição do Produto

Chemometrics and Intelligent Laboratory Systems 70 (2004) 99 – 112 www.elsevier.com/locate/chemolab

PLS_Cluster: a novel technique for cluster analysis A.S. Barros a, D.N. Rutledge b,* a

b

Departamento de Quı´mica, Universidade de Aveiro, 3810-193 Aveiro, Portugal Institut National Agronomique Paris-Grignon, UMR INAPG-INRA Inge´nie´rie Analytique pour la Qualite´ des Aliments (IAQA), 16 rue Claude Bernard 75005 Paris, France Received 18 March 2003; accepted 26 August 2003

Abstract A novel cluster analysis technique (PLS_Cluster) is proposed in this work. The method is based on the well-known PLS-regression procedure using a ‘‘self-organizing’’ mechanism to accomplish a clustering of the data. The implementation of this technique is straightforward and it provides several diagnostic insights into the reasons for clustering, by studying the obtained dendrogram, and by examining various statistical properties associated with the nodes. At each node it is possible to recover all the regression vectors common to PLS (loading X, loadings W, and B-coefficients) along with the scores X and scores y. This work presents the application of DiPLS_Cluster (a particular case of PLS_Cluster) to the analysis of several datasets and demonstrates the potentiality of this novel technique. D 2004 Elsevier B.V. All rights reserved. Keywords: PLS_Cluster; Cluster analysis; PLS

1. Introduction Cluster analysis (an exploratory technique) is one of the major fields in chemometrics for pattern recognition, unsupervised learning and data compression [1] and is very helpful for understanding the relationship among objects or samples, especially in large, complex datasets. It is also used to reveal natural patterns in datasets (knowledge discovery). The main goal of cluster analysis is to extract some sort of organizational entities from datasets. It is a technique used for combining observations into groups or clusters such that: (i) each group is homogeneous, i.e. the objects within each group are similar to each other; (ii) each group should be different from the other groups [2]. The objective of cluster analysis is very similar to factor analysis, where one wishes to find similarities among objects/variables. A major cornerstone in cluster analysis is identification or selection of a measure of similarity, which will ‘‘guide’’ the grouping. The similarities (or homogeneities) varies from analysis to analysis, and hence, depends on the objective and nature of the study [2]. * Corresponding author. Tel.: +33-1-44-08-16-48; fax: +33-1-44-0816-53. E-mail addresses: [email protected] (A.S. Barros), [email protected] (D.N. Rutledge). 0169-7439/$ - see front matter D 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2003.08.002

Cluster analysis can be classified as [2,3]: (1) Visual methods, (2) Hierarchical clustering (agglomerative and divisive), (3) Non-hierarchical clustering. Typically in hierarchical methods, a dendrogram is used to give a graphical representation of distances or similarities among objects and/or clusters. The use of this visual tool allows a rapid visualization of the clustering of objects, and facilitates the analysis of the clustering of nodes along the similarity axis, starting from a zero similarity to a maximum similarity value. In hierarchical clustering, several approaches exist for calculating the similarities between objects and clusters [3], such as: centroid method, nearest-neighbour or single-linkage method, farthest-neighbour or complete-linkage, average-linkage and Ward’s method. As opposed to hierarchical clustering, non-hierarchical clustering is based on the segmentation of the data into k partitions or groups (each one representing a cluster). With this method, the number of cluster must be known a priori. One of the difficulties in cluster analysis is to decide whether a new object is similar to a collection of known objects and this difficulty increases with the variability of the objects’ properties [4].

100

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

Many clustering methods are defined as optimization problems, where a criterion is evaluated for all possible partitions of a given datasets into g clusters, and the partition corresponding to the optimal criterion value is retained. The problem of finding a given ‘‘optimal’’ partition from all possible combinations of n patterns from g clusters is in many cases unsolved, due to the huge solution space that must be searched. Several approaches may be used to minimise the computational over-head for finding near-optimal solutions, using for instance genetic algorithms [5] and simulated annealing [6]. In many cases, however, it is interesting not only to extract relevant clusters but also to understand the reasons for their formation, i.e. to relate directly the cluster formation to the variables or features responsible for a given aggregation. Since the clustering procedure may be viewed as an optimization procedure (maximization or minimization of a given criterion), one possible path would be to use the well known PLS procedure in the optimization step (maximization of covariance between an X data and a y vector). However, the solutions normally given by PLS are not the global optima, but stable, near optimal solutions. On the other hand, it is a widely used method, with well known and understood properties [7– 15]. The present work describes this new methodology of data clustering based on the PLS algorithm. The aim was to develop a clustering method not only to group samples based on the inner variability and/or relationships among samples and/or variables (features) but also to have information on the reasons for the groupings. The proposed method is based on a self-organising mechanism that uses the PLS1 procedure to achieve a hierarchical segregation of the samples based the variability (or relationships) present in the X matrix, to progressively build up a feature vector (y) which characterises the relationships between the objects of the X matrix. PLS properties/entities such as the regression vectors, loadings X (p), W (w) and B coefficients (b) can be used to characterise the segregation (e.g. the chemical relevance). This new procedure can be used in two different approaches: (a) Dichotomic PLS_Cluster (DiPLS_Cluster) and (b) Generalized PLS_Cluster (GenPLS_Cluster). The present paper concerns the development and application of DiPLS_Cluster.

2.2. PLS as the kernel for DiPLS_Cluster The PLS regression technique is widely used in the field of chemometrics due to its robust characteristics compared to other regression methods. The fact that this regression procedure, contrary to many others, maximises at the same time the covariance between X and y means it could be used as a ‘‘self-organising scheme’’ to provide a clustering mechanism. Normally, the PLS regression procedure is applied to a known X matrix and y vector for the determination of a b vector, which describes the relationships between the two [7]. In matrix notation: y ¼ Xb þ e

ð1Þ

The b vector calculated in this calibration step can then be used to predict the y vector values from the X data features of new samples. In the context of clustering, one only has the X matrix and one wishes to determine the relationships among the samples. We propose to use the PLS procedure to iteratively generate an optimal y vector describing the relationships that exist among the objects (this y vector is defined in this work as feature vector). 2.3. DiPLS_Cluster general description As mentioned in the Introduction, the DiPLS_Cluster is a particular case of GenPLS_Cluster, where the hierarchical segmentation of the X matrix is based on two groups, and where only one Latent Variable (LV) is used to estimate the y vector used at each segmentation of the X matrix objects. The DiPLS_Cluster starts with a prediction vector of random ‘‘0’’ (zeros) and ‘‘1’’ (ones) in the y vector, i.e. two random groups. Then, iteratively, the y vector values are replaced by values predicted by PLS using 1 LV. The y vector progressively evolves to a series of values that reflect the internal structure of the X matrix. Once a stable y vector is obtained, it is used to segment the X matrix into two subgroups, one linked to the ‘‘0’’ and the other linked to the ‘‘1’’, in the y vector. This procedure is repeated iteratively for each subgroup, resulting in a segmentation of the X matrix as a binary tree (dendrogram). Table 1 shows the algorithm of this procedure.

2. Mathematical

2.4. DiPLS_Cluster properties

2.1. Notations

When the procedure stops, one can generate a dendrogram where it is possible to visualise different types of information. The dendrogram is composed of several nodes, each one representing a binary segmentation point, i.e. where a PLS with one Latent Variable and two groups was applied. Hence, at each node, it is possible to recover PLS statistics such as loadings x (p), loadings weights (w), scores (for X and y) and the b vector. This allows one to fully

Matrices are shown in bold uppercase (X), column vectors in bold lowercase (x) and row vectors as xT (transposed). For algorithm description, user defined functions are in italic, e.g. function() and high level description specific semantic terms in bold italics, e.g. if () do.

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112 Table 1 High-level description of the proposed DiPLS_Cluster procedure Algorithm zrandom group assignment 0 or 1 zdo a PLS1 with 1 Latent Variable and recover the b vector (3) y*= Xb + b0 znew calculated y* vector (4) membership(y*) zy* values membership assignment, i.e. (0 or 1) (5) c = conv(y y*) ztest for convergence (6) if (c > criterion) do zif no convergence obtained. . . (7) y = y* goto 2 zReplace y by y*. . . and iterate else do (8) [X1 X2] = split(X | y) zif convergence split X according to y grouping (9) X = X1 goto 1 zrecursive call to step 1 and iterate until X1 rows equal 2 zrecursive call to step 1 and iterate until X2 (10) X = X2 goto 1 rows equal 2 (11) stop (1) y p rand() % 2 (2) b = pls1(X, y, 1)

characterise each node, which means, a full characterization of the variability in X that is responsible for the maximization of the discrimination between the two corresponding subgroups. Since the method is recursive, the final result (dendrogram) can give an in-depth understanding of the X information in global terms, number of clusters, their properties, etc., and at the same time, perform local cluster analyses, i.e. within sub-sets of the X matrix. Another important characteristic of these dendrograms is the branches that link the nodes. The branch length should be inversely proportional to the number of iterations needed at each node to converge, i.e. to find a stable group attribution. If very few iterations are required, the samples treated at that node are very different and one can easily separate them into two groups. Therefore, a longer branch length should be plotted. Conversely, if the distinction between the samples is not obvious, the method requires more iterations to separate the two groups, and hence, the branch length should be shorter. The branch length therefore gives a measure of the difference between the objects in the X matrix. A final consideration is the calculation of the subgroup membership based on the calculated y vector. In step 1 of the DiPLS_Cluster algorithm (Table 1), one sees a group assignment based on two groups, an integer (Q space) value—0 or 1. However, in step 3, a new estimated y* vector is calculated with values that are in a non-integer form (R space), i.e. real values. The integer form is calculated by means of a membership function. This membership function may have several different forms. In this first presentation of the PLS_Cluster procedure, just one function is used to find the threshold that separates two groups: a function that roundsoff each y vector value to the nearest integer (parametric approach). Other types of membership functions could be used, for instance, a function that sorts the y vector values and then finds the largest difference between two consecutive sorted y vector values (non-parametric). Whatever the type of function used, the y vector values need to be converted to integers to have a clear group attribution (0 or 1) for the dendrogram plot. However, the non-integer y vector values

101

may be very useful for a more detailed analysis of the relationships among samples, as it gives a ‘‘continuous measure’’ of those relationships, as opposed to the ‘‘discrete measure’’ with the integer values. 2.5. DiPLS_Cluster dendrogram plot procedure DiPLS_Cluster outputs can be analysed in many ways, such as the study of the ‘‘discrete’’ and ‘‘continuous’’ classification at each node, the regression vector profiles at each node, etc. However, one helpful way to have an easier interpretation of DiPLS_Cluster results is to transform the ‘‘discrete’’ classification found at each node into a dendrogram. This is a logical step as the internal structure of the algorithm is a recursive one, and hence, one could build a classification tree. Each sample is classified as ‘‘0’’ (zero) or ‘‘1’’ (one) at each node (g) and level (h) result by means of a membership function, which may have many different transformation rules. Let C(n, k) be the classification vector and A(n, k) the iteration vector (containing the number of iterations necessary to achieve the segmentation) at a given level k ={1,. . .,h}, and where n is the number of samples at that node. Let m1 be the number of ‘‘1’’s at a given level k and let m0 be the number of ‘‘0’’s at a given level k. Then let Daux(n, k) be an auxiliary matrix that is related to the C matrix as: m1 if : C ði; kÞ ¼ 1 Daux ði; kÞ ¼ m0 if : Cði; kÞ ¼ 0 where i ¼ f1; . . . ; ng objects and k ¼ f1; . . . ; hg levels ð2Þ Given the Daux matrix, the dendrogram (D) coordinates are calculated as follow: 8 1 > > D ði; kÞ ¼ Dði; k 1Þ *Daux ði; kÞ; if Daux ði; kÞ < 0 > > k 1 > < > > 1 > > > : Dði; kÞ ¼ Dði; k 1Þ þ k 1 *Daux ði; kÞ; if Daux ði; kÞ > 0 where Dði; 0Þ ¼ 0

ð3Þ

The D matrix represents a uniform dendrogram where all the branches have the same length, i.e. the number of iterations is not take into account. To obtain an inversely proportional dendrogram matrix, the A matrix (iterations matrix) is used to adjust, proportionally, each branch of the D matrix according to the inverse of the number of iterations 1 required at the given node. The term k1 is necessary to avoid branch overlap by flipping the cluster. However, since this operation is applied in the y dendrogram axis, no real

102

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

Fig. 1. Near infrared spectra (4000 – 6200 nm) of the complete Pharmaceutical dataset. Data shown as transformed by Standard Normal Variates (SNV).

significance can be attributed to this axis, in terms of between cluster distance.

3. Experimental

20 linked to Group 1, 21 to 40 linked to Group 2 and 41 to 60 linked to Group 3. Along each row (object) the corresponding values (2, 5 or 6) are repeated 16 times (variables). In order to have a stable application of DiPLS_Cluster the PLS algorithm, a 0.001% Gaussian distribution noise was added to the matrix.

3.1. Dataset 1: simulated dataset 3.2. Dataset 2: Near infrared Pharmaceutical dataset Three pre-defined groups were created. Group 1 with values of 2, Group 2 with values of 5 and Group 3 with values of 6. An X(60, 16) matrix was built with objects 1 to

Near infrared reflection spectra of pharmaceutical pills were measured between 4000 and 200 nm (step 3.86 nm)

Fig. 2. Visible/Near infrared spectra (384 – 2000 nm) of the complete Apple dataset. Data shown as transformed by Standard Normal Variates (SNV).

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

103

Fig. 3. Near infrared spectra (1100 – 2500 nm) of the complete aqueous fructose solutions dataset. Data shown as transformed by Standard Normal Variates (SNV).

[Vector 33, Bruker]. The spectra were acquired from the top face of six pellets (T), from the bottom face of six pellets (B) and from equal proportions of top and bottom faces (M) (three face groups). Pills containing three concentration levels (three concentration groups) of active principal were used (Fig. 1). 3.3. Dataset 4: Visible/Near infrared Apple dataset This dataset was based on experiments organized in Norwich (UK) from 24/02/1999 to 25/2/1999 within the framework of the European Concerted Action ‘‘ASTEQ’’ [16]. The purpose of this experiment was to evaluate the feasibility of detecting differences among three different degrees of maturation (fresh, medium and mealy) for one apple cultivar (Jonagold) using different instrumental techniques. Visible/Near infrared spectra of apples were measured between 380 and 2000 nm (step 0.5 nm) using a VIS – NIR spectrophotometer (Rees Instruments, OSA, SiInGaAs detector). The number of data points were reduced 15 times by box-averaging to give a final step of 7.5 nm. Two repetitions and four fruit per repetition (repetition 1: 1– 4, repetition 2: 5 –8) and two different apple sides were measured (Face 1: ‘‘red’’ and Face 2: ‘‘green’’ sides) (Fig. 2). 3.4. Dataset 5: NIR fructose aqueous solutions Aqueous solutions of fructose at concentrations of 5%, 10%, 20%, 30%, 40%, 50% and 60% (mass/volume) were prepared in triplicate. Near infrared (NIR) transmission spectra were acquired for each sample (three repetitions) between 1100 and 2500 nm, with a step of 4 nm, using a Bran + Luebbe

Infralyzer 500, with the sample temperature adjusted to 25 F 1 jC [17] (Fig. 3).

4. Results and discussion This work starts by the application of DiPLS_Cluster to the simulated dataset with pre-defined group attributions. Then, real cases are analysed. The first of these shows the application of this method to a dataset that has two major classification factors, the one of interest being the most discriminant. The second case also has two possible classifications, but in this case it will be shown how the orthogonal property of DiPLS_Cluster can be used to enhance the second, more interesting factor. The third real case concerns the analysis of a dataset that has only one classification factor, but will be used to highlight the interpretation of the b vectors of the nodes. 4.1. Dataset 1: simulated dataset The first application of the DiPLS_Cluster procedure was on a very simple dataset containing three pre-defined groups to present the general properties of the method. Fig. 4a (uniform) and b (inversely proportional) represents the dendrogram plots obtained by the DiPLS_Cluster. As one can clearly see, the method has found the three pre-defined groups. At one level, an early separation between Group1 and Groups 2 + 3 is observed. At another level, Group 2 is then clearly separated from Group 3. Further separations are observed within these three major groups as a result of the noise added. The dendrogram also shows that Group 1 is located on one side of the y axis (negative), far from Groups 2 and 3 which are located on the same side of the

104

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

Fig. 4. (a) DiPLS_Cluster uniform dendrogram plot of the simulated dataset. (b) DiPLS_Cluster inversely proportional dendrogram plot of the simulated data.

y axis (positive) and hence are more ‘‘similar’’ when compared to Group 1. Thus, the relative ‘‘distance’’ between groups is preserved (since the difference between ‘‘5’’ and ‘‘6’’ is smaller than between them and ‘‘2’’). Looking closer at the inversely proportional dendrogram plot (Fig. 4b), one sees that the first segmentation obtained at node 1 rapidly found (two iterations) two major groups, one corresponding to Group 1 samples and the other to Groups 2 + 3 samples. This node has two branches, one linking it to node 21, which separates the samples within Group 1, and the other branch linking it to node 2, which separates Group 2 from Group 3 samples. These two

branches have different lengths, the latter being longer than the former. This is clear, since the node 1 ! node 21 branch is related to the segmentation within Group 1 samples, mostly due to random variations, took 13 iterations to accomplish it. On the other hand, the node 1 ! node 2 branch is much longer as the dissimilarity between Group 2 and Group 3 was detected more quickly (two iterations). It is also important to highlight that the position of the cluster and samples along the y axis does not have any significance as it results from the dendrogram plot routine used to avoid branch overlap.

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

105

Fig. 5. NIR pharmaceutical DiPLS_Cluster dendrogram plot.

4.2. Dataset 2: NIR spectra of pharmaceutical products The proposed clustering procedure was applied to an NIR dataset acquired for pharmaceutical pills. The major aim was twofold: (i) to evaluate the capabilities of this method to distinguish the concentration levels of the active principal in the pills, and the side of the pill where the spectrum was acquired, and (ii) to see the influence of the pill side on the estimation of the active principal concentration. The application of DiPLS_Cluster to this dataset gave, among other results, the dendrogram shown in Fig. 5. This plot shows that the active principal concentration levels, as

well as the pill side where the spectrum was taken, determines the segmentation of the dataset. Comparing both potential sources of variability, it is clear that the major source of variability is related to the active principal concentration. This is obvious from Fig. 5, as the three major clusters observed are related to the active principal concentration levels, being most noticeable between the highest active principal concentration and the other two levels. This is reasonable as the concentration levels were 2, 5 and 10 mg/pill. Moreover, within each active principal concentration level, one can observe the pill-side effect (depicted as T, B and M).

Fig. 6. b vector profiles plot of nodes 1 and 2 for NIR spectra.

106

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

The DiPLS_Cluster has identified and distinguished the different major sources of variability using only the information present in the NIR spectra. This result also shows that the active principal concentration can be estimated independently of the side of the pill used to acquire the spectra. Another important point is once again the difference in branch lengths. For instance, the distance from node 1 to node 19, and from node 1 to node 2 show that it is more difficult to distinguish the pill-sides within a concentration level than to distinguishing the concentration Groups. Given the information that one can recover from the dendrogram plot of Fig. 5, it would also be interesting to evaluate which NIR regions are most responsible for the two major sources of observed variability. This could be used to select optimal spectral regions for a given factor. One can see in Fig. 5, for instance, that node 1 distinguishes clearly two concentrations of active principal. The highest concentration is located on the negative y axis, and the two other concentrations are on the positive y axis. Therefore, the spectral variations responsible for this discrimination can be determined by examining the b vector profile for node 1 (Fig. 6—thick line). Several observations can be made concerning this profile. The first one is that one can see the most important spectral regions to discriminate between the highest concentration level and the two lower levels. The second remark is that this difference is very significant as the number of iterations required to obtain this segmentation is very low (three iterations)—i.e. the convergence is very fast, showing that the major source of variability in the dataset is the active principal concentration factor. For the effect of the pill-side, one can examine, for instance, node 19 which represents all the samples belonging to the highest active principal concentration level. In this case, the distinction between side B, on the one hand,

and sides T and M, on the other, can be characterised by the variability found in the corresponding b vector profile (Fig. 6—thin line). This observed difference, although important, is less significant than the active principal concentration effect since, for node 19, it took 12 iterations to attain the segmentation based on the pill sides. Another important fact to be highlighted in the comparison between the nodes 1 and 19 b vector profile concerns the relative importance of the extracted information. For node 1, and for the nodes linked to the distinction between active principal concentrations, the b vector profiles are less noisy (more structured) than those found for nodes related to the pill side. This suggests, once more, that the variability related to the active principal concentration is more important than that related to the pill side. This shows that with this approach it is possible to have a visualization of the major sources of variability among the samples in a dataset, and at the same time, know the reasons for the clustering of the samples, by analysing the node b vectors, i.e. based on the spectral regions that are most related to a given clustering. 4.3. Dataset 3: Apple Visible/Near infrared spectra In this section, DiPLS_Cluster will be applied to a more complex dataset. This dataset comes from an attempt to use spectroscopic data acquired on the surface of apples to develop models to predict the apple ripeness level. As already seen in Section 3, this dataset was acquired on two different apple regions, defined as Face 1 and Face 2. The Faces presents a difference in colour (‘‘red’’ face and ‘‘green’’ face). One of the aims of the analysis of this dataset was to see if the effect of the Face factor could be eliminated, as it may hinder the efficient modelling of the ripeness level.

Fig. 7. Visible/NIR DiPLS_Cluster dendrogram plot as a function of Face factor (known sample classification shifted along x axis for clarity).

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

107

Fig. 8. Visible/NIR DiPLS_Cluster dendrogram plot as a function of ripeness level (known sample classification shifted along x axis for clarity).

The initial application of DiPLS_Cluster to the dataset (ranging from 384 to 2000 nm) gave a clustering clearly related to the Face factor so the variability related to the apple colour has in large part masked the effect of the ripeness level. Fig. 7 shows the dendrogram plotted as a function of the Face factor. Apart from a few misplaced samples, a clear distinction between Face 1 and Face 2 is obvious. One can conclude that this is the main source of distinction since the plotting of the same dendrogram as a function of the ripeness level (Fig. 8) shows a much less significant separation of the samples. However, a careful examination reveals some differences between the ripeness levels within each Face. One can see, for instance, for both Face 1 and Face 2 that the Fresh level forms a cluster

located closer to the y axis origin, and that the other two ripeness levels are mostly located further away. In order to compare the cluster attribution by DiPLS_Cluster with the sample distribution provided by Principal Component Analysis (PCA), the first two Principal Components (PCs) are shown in Fig. 9. One can observe in this plot that PCA can distinguish both apple faces with the same misplaced samples as DiPLS_Cluster. At the same time, the clusters identified using DiPLS_Cluster are drawn in the plot. It is interesting to note that generally the results provided by PCA and DiPLS_Cluster seem to be the same. However, DiPLS_Cluster can provide some insight onto why the samples are misclassified by the examination of the corresponding node b vectors.

Fig. 9. Visible/NIR PCA scores scatter plot (PC1 vs. PC2).

108

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

Fig. 10. Visible/NIR dataset for node 1 b vector profile plot.

Since the most important segmentation is based on the Face factor, and since this is most marked in node 1, one can recover the b vector related to this node and see (Fig. 10) which spectral regions are responsible for this distinction. In Fig. 10, the node 1 b vector plot shows a very important band located at 549 nm, i.e. it is related to the pigments responsible for the colour differences between the two sides. As already seen in Section 2, the DiPLS_Cluster is a binary segmentation (two groups at a time) technique based on PLS using only 1 Latent Variable. One of the most

interesting properties of DiPLS_Cluster is its orthogonality. This means that after the extraction of one latent variable, one could extract further factors by re-applying the DiPLS_Cluster to the recovered X error matrix at each node. This X error matrix can be defined as the data not modelled by the Latent Variable previously extracted at that node. This could be useful in cases where one could have different clustering trends that may be independent. Using this approach, i.e. taking advantage of the DiPLS_Cluster orthogonality, it should be possible, in many cases, to highlight hidden features (clusters) in a complex dataset.

Fig. 11. Dendrogram plot as a function of Ripeness level for DiPLS_Cluster based on the X error matrix, obtained from the initial analysis of the Visible/NIR dataset (known sample classification shifted along x axis for clarity).

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

109

Fig. 12. Dendrogram plot as a function of Face factor for DiPLS_Cluster based on the X error matrix, obtained from the initial analysis of the Visible/NIR dataset (known sample classification shifted along x axis for clarity).

The procedure was applied to the DiPLS_Cluster X error ˆ matrix (differences between the original X matrix and the X matrix modelled using the Latent Variable at node 1) recovered at node 1. Originally, the first Latent Variable was responsible for a clustering based on the Face factor at node 1. In this new application of DiPLS_Cluster, the Ripeness level factor emerges. As can be seen, the resulting dendrogram shows a clear tendency to discriminate the ripeness levels (Fig. 11). From Fig. 11, it is clear that samples belonging to the Fresh level are separated from the Mealy ones—except for a few misplaced samples. Concerning the middle level (Medium), this separation is not as obvious.

However, a closer look at the dendrogram shows that: (i) a few Medium samples seem to be more like the Fresh ones; (ii) a great number of Medium samples are well separated from the Fresh level and (iii) a trend in the separation between the Medium and Mealy levels is visible. This new clustering seems to be orthogonal to the first one (Face clustering) for two important reasons: (i) The dendrogram plotted in Fig. 11 shows a new clustering trend, based on the ripeness level and (ii) the initial clustering based on the Face factor has been entirely removed in this new analysis. Fig. 12, where the DiPLS_Cluster dendrogram is plotted as a function of the Face factor, shows no Face clustering.

Fig. 13. Visible/NIR PCA scores scatter plot (PC2 vs. PC3).

110

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

Fig. 14. b vector profiles plot for the DiPLS_Cluster based on the node 1 X error matrix, obtained from the initial analysis of the Visible/NIR dataset.

Once more, this result is compared to a PCA of this dataset. The scores scatter plot of PC2 vs. PC3 seems to be related to the ripeness level (Fig. 13). From the PCA scores plot there is a trend for the distinction between the three different ripeness levels. The cluster attribution provided by the DiPLS_Cluster is outlined in the same plot. Again, this procedure can help identify the reasons for the misclassification of a priori known samples by the analysis of the nodes associated with the segmentation of those samples. At this point it would be interesting to determine which spectral region is responsible for the distinction between the three ripeness levels. Fig. 11 shows the nodes where that

separation is most important. Node 1 separates the Fresh and the Medium + Mealy levels. The distinction between Medium and Mealy is mainly located at node 75. The corresponding node b vector profiles are plotted in Fig. 14, where it is clear that a major band, at 668 nm, is related to the separation of the three ripeness levels. It is interesting to note, once more, that the effect of the Face factor is almost completely removed. 4.4. Dataset 4: NIR spectra of aqueous fructose solutions As can be seen in Fig. 15, the application of DiPLS_Cluster to the NIR spectra of aqueous fructose solutions success-

Fig. 15. Dendrogram plot for DiPLS_Cluster on NIR spectra of aqueous fructose solutions.

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

111

Fig. 16. b vector profiles plot of the most important dendrogram nodes of the DiPLS_Cluster on NIR spectra of aqueous fructose solutions.

fully reproduces the a priori clustering, i.e. the different fructose concentration levels. An important factor that can be highlighted in this dendrogram is the separation into two main clusters (located at node 1): fructose concentrations ranging from 5% to 30% (negative y axis), and those ranging from 40% to 60% (positive y axis). This observation has been related to a major re-arrangement of the hydration of fructose molecules in the aqueous solutions [17]. As pointed out previously, several statistical entities are associated with each node. As an example, in Fig. 16, the b vectors related to the separations observed at nodes 3 (5% vs. 10%), 20 (20% vs. 30%), 36 (50% + 40% vs. 60%) and 37 (40% vs. 50%) are plotted. It can be seen that the spectral regions related to the observed clustering are common to all clusters, with small differences in certain regions. Despite this similarity of the b vectors, they have different weights for the different clusters. Moreover, some important bands have slightly different shapes (e.g. those bands located around 1410, 1908 and 1968 nm). This confirms that DiPLS_Cluster cannot only provide information about the relationships among samples, but also, through the b vectors, give information about the reasons for the grouping.

5. Conclusion This work proposes a new approach (DiPLS_Cluster) for performing cluster analysis. The most important advantage of this method is that it allows the use of the vectors calculated by the PLS regression to help understand the causes of the segmentation of the dataset. The method can

be used as a tool to provide in-depth global and local analyses of the factors that contribute to the segmentation based only on the internal variability among the samples. The results presented show that both the interpretation of the dendrograms and their characteristic nodes contribute to a better understanding of the data being analysed, the b vectors in particular aiding in the interpretation of the reasons for the grouping, in terms of the variables. Another advantage of PLS_Cluster is the possibility to use a range of method-independent membership functions. This will allow the implementation of different membership functions to analyse a dataset. This is an important factor, since one could analyse a given dataset from different perspectives, highlighting different relationships, simply by varying the properties of the membership function. The implementation of different membership functions will be carried out in future works to evaluate the importance of their effect on the segmentation. It has also been shown that the proposed method preserves the ‘‘orthogonality property’’ of PLS, meaning that it can recover different sources of variability from the same dataset. Hence, it should be possible in many cases to remove interfering factors. Furthermore, one could also apply a PCA to the initial dataset and then use PLS_Cluster on a given PCs sub-set. Furthermore, being based on PLS regression, this new procedure could be used to classify new samples, simply by recovering the b vectors from each node, and by applying them to trace the path through the dendrogram to class the new sample into a cluster. The PLS_Cluster framework allows the application of cross-validation to assess cluster validity. This can be implemented as this procedure is entirely based on the

112

A.S. Barros, D.N. Rutledge / Chemometrics and Intelligent Laboratory Systems 70 (2004) 99–112

PLS1 algorithm (using the b vector, it is possible to predict where a temporarily deleted object is classified). One could envisage a cross-validation not only at the general dendrogram level (at the bottom or at the top) but also at each node (segmentation) level.

Acknowledgements We thank Dr. Ann Peirs (KU Leuven, Belgium) who acquired the Visible/Near infrared spectra of apples and the European Commission for financing the ‘‘ASTEQ’’ Concerted Action (Contract no. FAIR5-CT97-3516).

References [1] R.-Q. Yu, Introduction to Chemometrics, Hunan Education Publishing House, Changsha, 1991. [2] M. Forina, C. Armanino, V. Raggio, Anal. Chim. Acta 454 (2002) 13 – 19. [3] S. Sharma, Applied Multivariate Techniques, Wiley, New York, 1996, p. 185.

[4] M. Daszykowski, B. Walcsak, D.L. Massart, Chemometr. Intell. Lab. Syst. 56 (2001) 83 – 92. [5] J.-H. Jiang, J.-H. Wang, X. Chu, R.-Q. Yu, Anal. Chim. Acta 354 (1997) 263 – 274. [6] F. Gan, G. Xu, L. Zhang, Y. Liang, Anal. Sci. 17 (2001) 869 – 873. [7] S. Wold, H. Martens, H. Wold, in: A. Ruhe, B. Ka¨gstro¨m (Eds.), Lecture notes in Mathematics. Proceedings of the Conference Matrix Pencils, March, Springer, Heidelberg, 1982, pp. 286 – 293. [8] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1 – 17. [9] S. Wold, M. Sjo¨stro¨m, L. Eriksson, Chemometr. Intell. Lab. Syst. 58 (2001) 109 – 130. [10] S. Wold, J. Trygg, A. Berglund, H. Antti, Chemometr. Intell. Lab. Syst. 58 (2001) 131 – 150. [11] I.S. Helland, Chemometr. Intell. Lab. Syst. 58 (2001) 97 – 107. [12] H. Martens, Chemometr. Intell. Lab. Syst. 58 (2001) 85 – 95. [13] R. Manne, Chemometr. Intell. Lab. Syst. 2 (1987) 187 – 197. [14] I.S. Helland, Scand. J. Statist. 17 (1990) 97 – 114. [15] I.S. Helland, Commun. Stat., Simul. 17 (1988) 581 – 607. [16] http://www.inapg.inra.fr/ens_ rech/siab/asteq/; (October 2002). [17] D.N. Rutledge, A.S. Barros, R. Giangiacomo, in: G.A. Webb, P.S. Belton, A.M. Gil, I. Delgadillo (Eds.), Magnetic Resonance in Food Science 2001. Proceedings of the Conference, September 2000, Aveiro, RSC, 2001, pp. 179 – 192.

Lihat lebih banyak...

PLS_Cluster: a novel technique for cluster analysis

Descrição do Produto

Comentários