Multivariate exploratory data analysis and graphics: a tutorial

May 31, 2017 | Autor: Claus Weihs | Categoria: Analytical Chemistry, Chemometrics, Prediction, Multivariate Analysis, Dimension Reduction, Graphics, Regression, Smoothing, Histogram, Exploratory Data Analysis, Prediction Models, Boxplot, Resampling, Statistical Regression, Graphics, Regression, Smoothing, Histogram, Exploratory Data Analysis, Prediction Models, Boxplot, Resampling, Statistical Regression

Share Embed

Denunciar este link

Descrição do Produto

JOURNAL OF CHEMOMETRICS,VOL. 7, 305-340 (1993)

REVIEW ARTICLE MULTIVARIATE EXPLORATORY DATA ANALYSIS AND GRAPHICS: A TUTORIAL C. WEIHS Mathematical Applications, Idormation Services, R-IOO8.22.22,CIBA-GEIGY Ltd., CH-4002Basel, Switzerland

SUMMARY Exploratory data analysis (EDA) is a toolbox of data manipulation methods for looking at data to see what they seem to say, i.e. one tries to let the data speak for themselves. In this way there is hope that the data will lead to indications about ‘models’ of relationships not expected apriori. In this respect EDA is a pre-step to confirmatory data analysis which delivers measures of how adequate a model is. In this tutorial the focus is on multivariate exploratory data analysis for quantitative data using linear methods for dimension reduction and prediction. Purely graphical multivariate tools such as 3D rotation and scatterplot matrices are discussed after having introduced the univariate and bivariate tools on which they are based. The main tasks of multivariate exploratory data analysis are identified as ‘search for structure’ by dimension reduction and ‘model selection’ by comparing predictive power. Resampling is used to support validity, and variables selection to improve interpretability. KEY WORDS

Stem & leaf display Histogram Boxplot Quantile plot Scatterplot Regression Smoothing 3D rotation Scatterplot matrix OMEGA strategy Dimension reduction Stability of structure Resampling Interpretation of structure Prediction models Variables selection

1. INTRODUCTION

1.1. Aims of exploratory data analysis

J. W. Tukey claimed that his fundamental work Exploratory Data Analysis’ is based on the principle that ‘it is important to understand what you can do before you learn to measure how well you seem to have done it’, and indeed, exploratory data analysis (EDA) is a toolbox of data manipulation methods for ‘looking at data to see what they seem to say’. To this end the basic EDA methods try to make the data more easily and effectively handleable by the user, whether statistician or non-statistician. The great popularity of EDA methods doubtlessly stems from the fact that most of them are mainly graphical. Looking at the ‘right graphics’ proved to improve our understanding of the data much more than just considering summary statistics possibly even based on very restrictive assumptions which we know we cannot check in practice. In exploratory data analysis one tries to lose as little information as possible in order to let the data speak for themselves. In this way the data should lead to indications about ‘models’ of relationships not expected apriori. This relates to the above basic principle in that

i3386-9393/?3/050305-36$23.00 0 1993 by John Wiley & Sons, Ltd.

Received 10 September 1992 Accepted 22 March 1993

306

C. WEIHS

exploratory data analysis is a pre-step to confirmatory data analysis which delivers measures of how adequate a model is. Thus exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone - as the first step. Tukey’ only describes univariate and bivariate EDA techniques. All that was said about EDA should, however, be true also for multivariate exploratory data analysis (MEDA) techniques, i.e. for the joint analysis of more than two variables (parameters, characteristics). The only difference is the higher complexity of the methods, which again appear mostly in graphical illustrations but which are often heavily based on non-trivial algebraical calculations. MEDA methods, as they are understood in this paper, mainly comprise dimension reduction and prediction techniques. The outcomes of such methods are mainly presented by multivariate versions of exploratory univariate and bivariate graphics, sometimes called visual exploratory data analysis (VEDA) methods.2 In this paper the term MEDA is used not only for the algebraical but also for the graphical methods. Since (M)EDA methods are mainly based on no assumptions, it generally appears difficult to identify a priori the most adequate methods for an actual data set. Instead, a toolbox is offered with the idea of carrying out parallel analyses applying different methods. This is particularly true for dimension reduction and prediction, where the different methods optimize different criteria, leading to different views (dimension reductions) of high dimensional data or to different prediction models. Nevertheless, using a reasonably ‘standardized’ user-directed flow of techniques proved to be very useful, especially in routine industrial p r a ~ t i c e . ~ - ~ Another consequence of using no assumptions is the ‘softness’ of the (M)EDA results. Again following Tukey,’ EDA is detective work. However, as is well known, detectives may be trapped by wrong clues, i.e. something which seems to be true on inspecting the graphics might afterwards prove to be a phantom. Senn6 joked that ‘EDA (is) the art of discovering a Rembrandt in a Jackson Pollock’. Note that the term ‘art’ may indicate that one needs not only practice but also some talent to become a good data analyst, and that the term ‘discover’ not only confirms the relation to detective work but also indicates the triumph having detected the unexpected. However, to confirm what has been detected, the user should rely on his common sense, which may well suspect that the result is spurious. Typically, data analysis is applied to ‘unstructured data’, i.e. to data not primarily gathered for the purpose of data analysis. This paper mainly relies on three examples (see Sections 1.2- 1.4). One example concerns data from industrial production, where controlled variation of control factors is not appropriate because in this way product quality might be corrupted. Thus the data are just ‘observational’. The other examples are from quantitative structure-activity relationship (QSAR) analyses where the data were generated following chemical, not statistical, aspects. Typical (M)EDA aims are (i) illustration and characterization of the distribution of single variables, i.e. location, variation and shape of distribution, ‘outliers’ (ii) the finding of relationships between two or more variables (structure recognition) (iii) the prediction of some of the variables by means of the other variables, in particular for quality improvement. The paper focuses on multivariate analyses, but starts with a discussion of those univariate and bivariate exploratory methods which are the building blocks of multivariate analyses. Moreover, the paper focuses on linear methods for quantitative, measured data. Nonlinearities are only discussed at some length with the bivariate building blocks (Section 3),

MULTIVARIATE EDA AND GRAPHICS

Look at what you have got

1 specified group

(original data) sections 2-4

3 07

'

D

criterion based views : dimension reduction section 5 V

tween 2 specified

model selection : prediction section 6

-- -*codinnation

Figure 1 . Paper structure

qualitative data only with the univariate building blocks and the prediction methods (Sections 2 and 6). Other non-linear methods and methods for qualitative data will be discussed only briefly, in particular by reference. Also, the paper only discusses one- and two-block analyses, in structure will be searched for inside one specified group of variables and models will be selected in order to predict one group of variables by one other group. A reference for multiblock analyses is the book by Gifi.' The plan of the paper is as follows: In Sections 1.2-1.4 the data sets used to illustrate the methods will be introduced, Sections 2 and 3 relate to univariate and bivariate graphics, Section 4 relates to multivariate graphics and Sections 5 and 6 introduce dimension reduction and prediction methods as well as methods for the comparison of the outcomes of different dimension reductions and predictions. Whereas Sections 2-5 study the properties and relationships within one specified group of variables, Section 6 explores the relationships between two specified groups of variables suggesting a prediction model. In this way Section 6 builds a bridge between exploratory and confirmatory analysis in that the adequacy of such a model has to be confirmed. Figure 1 illustrates the structure of the paper. 1.2. Example 1: dyestuff production

During a major CIBA-GEIGY investigation into the quality of dyestuffs, 29 variables were measured for 93 dyestuff batches (observations) for a certain blue dyestuff. The 29 variables fall into two categories: 18 correspond to analytical properties, eleven relate to quality measures. The analytical properties correspond to portions of well-defined chemical compounds in the dyestuff (or derived quantities); the quality measures are visual judgments or technical measurements of the color perception (strength; hue and brightness differences) of the produced dyestuff compared with a standard when applied to polyester. Details of the variables are provided in Table 1 . The analytical properties are mainly reported as percentages of the compounds in question relative to all compounds or to all organic compounds. Two analytical properties (17, 18) were constructed by the producer in order to predict hue differences. One of these, the centre-of-gravity wavelength, was constructed from the maxima of the spectra of all relevant compounds. Of the quality measures, five variables (19, 22, 23, 26, 27) are visual judgements translated to a numerical scale and six variables (20, 21, 24, 25, 28, 29) are technical colorimetric measurements generated by an agreed, partly normed algorithm.

308

C. WEIHS

Table 1. Dyestuff production: variables Variable

Unit

Analytical properties 1 2 3

TERCUP DNANDI DNBZDI SECMC TERTMC PRIMSC SECSC DNSEC TERTSC SUNKUV SUMUV SUMRED SUMGRD SUNKDY SUMDYE TOTORG SEC/TE LAMBDAC

Quality measures STRVI STRREM STRTRA HUEVI HUEVIAL HUEREM HUEREMAL BRIVI BRIVIAL BRIREM BRIREMAL

Tota

16 17 18

TERtiary CoUPling component DiNitro-ANinline-DIao component DiNitro-BenZole-DIazo component SECondary Main Component TERTiary Main Component PRIMary Side Component SECondary Side Component DiNitro SECondary side component TERTiary Side Component Sum of UNKnown UV absorbers SUM of known UV absorbers SUM of RED dyestuffs SUM of GReen Dyestuffs Sum of UNKnowns Dyestuffs SUM of DYEstuffs TOTai of ORGanic compounds SECondary/TErtiary main component LAMBDA Centre-of-gravity wavelength

19 20 21 22 23 24 25 26 27 28 29

STRength, VIsually STRength, REMission STRength, TRAnsmission HUE, VIsually, daylight HUE, VIsually, Artificial Light HUE, REMission, daylight HUE, REMission, Artificial Light BRIghtness, VIsually, daylight BRIghtness, VIsually, Artificial Light BRIghtness, REMission, daylight BRIghtness, REMission, Artificial Light

vosd vo s vo s codeHe codeH CIELAB' CIELAB codeBg codeB CIELAB CIELAB

4 5 6 7 8 9 10 11 12

13 14 15

vot vot Tot vot vot

Tot Tot vot Tot vot vot vot vot Yob T O

nmc

'Vot, percentage of TOTal of ORGanic compounds (TOTORG); bVo, percentage of all compounds; 'nm, nanometers; Vos, percentage relative to the standard; 'codeH, codes for hue differences from the standard translated into numerical equivalents ( + , redder; - , greener); CIELAB, colorimetric international standard hue and brightness differences; codeB, codes for brightness differences from the standard translated into numerical equivalents ( + ,brighter; - , duller).

The main questions of interest to the producer were the following. (i) Is there any interesting structure in the data? What do the empirical distributions of the variables look like? Are there striking observations? Are there (linear or nonlinear) relationships between the variables? The motivation behind these questions is the desire to characterize the production by characterizing the distribution of important variables, to 'clean' the data from outliers and to reduce the amount of measurements by eliminating superfluous variables. (ii) Can the colour properties of the dye be predicted from the analytical measurements and, if so, how? The motivation behind this question is the desire to control the production process in order to produce dyestuffs with the required colour properties. Batches which fail to come up to standard require costly and time-consuming corrective treatment.

MULTIVARIATE EDA AND GRAPHICS

309

1.3. Example 2: Voltaren analogs

The data were essentially taken from a retrospective quantitative structure-activity relationship study of the anti-inflammatory potency of a series of substances similar to diclofenac, the well-known CIBA-GEIGY drug Voltaren. The data include 32 such so-called analogs of the basic structure (observations) of 13 physicochemical properties of the molecules and two biological activities, one in vitro (PGS) and one in vivo (AdA). In this study one was interested in the best prediction of the biological activities by means of the physicochemical properties. In particular, one wanted to compare the predictive power of the multivariate dimension reduction methods with regression, which was the only method studied before with these data.’

’

1.4. Example 3: drug under development During the development of a new drug, 19 variables were determined for 22 analogs of a basic drug structure (observations). The 19 variables fall into two categories: 18 correspond to physicochemical properties, one relates to biological activity. The 18 physicochemical properties are lipophilicity, three electronic and four steric properties of the structure elements in two out of five relevant structure positions of the molecules (positions 4 and 6) and two position occupation indicators for the other three positions (positions 2, 3 and 5). After elimination of highly correlated variables, eleven physicochemical variables were left. The biological activity is a qualitative variable with values ‘effective’ and ‘ineffective’. In this study one was interested in suggestions for new molecules with improved biological activities. 2. UNIVARIATE BUILDING BLOCKS Every MEDA is built around multivariate graphical tools, which all use univariate and bivariate graphics as building blocks. Most of the building blocks discussed in this paper were (at least in a similar way) already proposed by Tukey. Nevertheless, the building blocks will be discussed briefly, not only because of completeness but also in order to introduce extensions to standard use. The univariate graphical building blocks to be discussed are the stem & leaf display, the histogram, the boxplot and the quantile plot. All four tools try to illustrate the distribution of unstructured data. Thus a possible time influence on the data will in no case be detectable. However, this is the only information lost using stem & leaf to display the data. Using a histogram, the individual data values are also discarded; only the frequencies of data classes are displayed. On the other hand, histograms are more appropriate than stem & leaf displays for identifying different strata, i.e. different observation groups. Those parts of a histogram branch corresponding to different strata can be shaded or coloured differently. A boxplot just displays some summary statistics, while a quantile plot checks the observed distribution against a theoretical one. 2.1. Stem & leaf display

Table 2 shows typical unstructured data listed in historical order. The data correspond to HUEREMAL, i.e. to hue deviations from a standard under artificial light ( - , greener; + , redder). Figure 2 presents the corresponding stem & leaf display. A stem & leaf display is a

310

C . WEIHS

Table 2. Typical unstructured data: hue deviation from the standard under artificial light ~

-0.330 0.100 -0.710 -0.090 -0.190 -0.210 - 0.410 0.260

- 0.270

-0.230 -0.670 -0.130 -0.090 -0.400 -0.560 0.140

- 0.150 -0.160

- 0.170 - 0-060 - 0.180 0.300 -0.390 - 0.630 - 0.260

-0.210 0.260 -0.070 -0.210 -0.150 -0.200 - 0.130 -0.230 - 0.420 -0.140 - 0.350 - 0.640 - 0.570 -0.180 - 0.070

mln = -0.710 -07 I1

-0.080 - 0.060 -0,150 - 0.190 - 0.260 -0.220 - 0.480 -0.150

- 0.140

0.010 -0.230 0.130 -0.250 - 0.350 - 0.490 -0.210

ma% = 0.300

772

-02 I9988766544333322211111110

-00199877766530 0011 0 1 I01347 02 I266

0310

#

1

-06 I743 2 -04198322110

-

0.220 -0.030 - 0.130 - 0.160 -0.210 -0.380 - 0.420 -0-210

3

~~

~~

- 0.120

0

- 0.050

-0.180 - 0.290 - 0.220 -0.210 -0.110 -0.240 - 0.190 - 0.280 - 0.380 - 0.430 -0.240 - 0.330

-0.190 -0.410 -0.160 -0.190 -0.320 -0.170 0.110

-0.070 -0.290 -0.230 -0,220 -0,280 -0.520 0.170

interpretation noticeably greener

3 8 slightly greener 8 25 very slightly 11 asrequired I Itd 5 very 3 slightly redder 1 (td

Figure 2. Stem & leaf display: hue deviation from the standard under artificial light

clever ordering of the data in a ‘table’ raising a graphical impression. The main idea is to split a data value into stem & leaf, The leaf is the last non-constant digit of the value, possibly after rounding. The stem is built out of the leading digits. The stem values are arranged vertically, ‘gaps’ are filled with blanks. The leaf values are arranged horizontally in branches right beside the corresponding stem value. The leaves are ordered by size. Stem & leaves are separated by a vertical line, The number of leaves in a branch is noted right beside the branch. The speciality of the stem & leaf display in Figure 2, however, is that an interpretation of the distribution is added to the standard display. This interpretation has to be delivered by the user and is nothing but a categorization of the displayed variable attaching appropriate names to the classes. Such a categorization should reflect the usual, possibly informal, interpretation of outcomes of the displayed variable. In our case there are two kinds of categorization, one reflecting the corresponding colour judgement and the other reflecting whether the corresponding batches would be accepted as type conform (tc) or not (ntc). Using these categorizations, the stem & leaf display leads to the following statements about the distribution of HUEREMAL. (i) (ii) (iii) (iv)

The The The The

range of the data is from ‘noticeably greener’ to ‘very slightly redder’. centre of the data lies with ‘very slightly greener’. distribution of the data around the centre is fairly symmetric. distribution is approximately normal.

Moreover, there are hardly any ‘redder’ batches and only 23 out of 93 batches are ‘type conform’. Note that the centre being with ‘very slightly greener’ was intended, since it is much easier to correct greener than redder batches, and indeed, only greener batches had to be corrected from being ‘not type conform’.

MULTIVARIATE EDA AND GRAPHICS mpx = 0300

min = -0.710 -07 Ir

# interpretation

1 notfceably 3 greener 3 fJltc) 8 slightly greener 8 htc) 25 Very slightly

-0618

-

-041n8888S8S

-02 1 S88888-

A

U

a

-001wo00000000 0010 01 I Orrrr 02 I Orr

031r

311

p

o

a

24

Qm

11 asrequired 1

5 3 1

Itd

very slightly redder (tc)

Figure 3. Qualitative stem & leaf display: hue deviation from the standard under artificial light (measured (stem) versus visually judged (leaves)). Legend: m, much greener; n, noticeably greener; s, slightly greener; v, very slightly greener; 0, as required; r, redder

For a qualitative variable the stem & leaf display degenerates to a histogram (see Section 2.2). An adapted stem & leaf display may be useful, however, for a bivariate characteristic where one variable is qualitative and the other is quantitative. In this case the quantitative variable is used for building the stem and the qualitative variable is displayed as the leaves. In order to improve readability, the qualitative variable might have to be coded. In this way the relationship e.g. between the hue deviations measured (quantitative) and visually judged (qualitative) under artificial light may be illustrated as in Figure 3. Note that the interpretation component corresponds to the quantitative ‘stem-variable’. Obviously, the correspondence of the interpretation of the measurements and the visual judgments is very good.

’

2.2. Histogram

The histogram corresponding to the stem & leaf display in Figure 2 is shown in Figure 4 after stratification. Indeed, the histogram, ignoring all information inside the branches of the stem & leaf display, is ready to reflect different strata, i.e. grouping of the data, by different shades or colors. In the example the darker-shaded batches correspond to batches prepared to be sent to Indonesia, and to be worked up there. One might guess that the distribution of hue deviations is similar for the ‘normal’ and the Indonesian batches. For one qualitative variable the histogram is often drawn with separated vertical bars (bar chart) in order to indicate that the values on the x-axis are discrete. If the qualitative variable

Figure 4. Stratified histogram: hue deviation from the standard under artificial light

312

C. WEIHS

Figure% Pareto diagram: visual judgment of hue deviation under artificial light. Legend: m, much greener; n, noticeably greener; s, slightly greener, v, very slightly greener; 0, as required; r, redder

is indeed nominal, i.e. its values do not have any ‘natural’ order, the histogram is often ordered by bar size. Adding a curve characterizing the summed-up percentages leads to a socalled Pareto diagram, which is often used to identify ‘important’ values of such a variable. A slight misuse of a such diagram for an ordinal variable is presented in Figure 5 identifying those visual judgments of hue deviations ‘typical’ for the production of the dyestuff investigated. Obviously, the most typical judgement was ‘very slightly greener’, but also the neighbouring judgments ‘slightly greener’ and ‘as required’ appeared reasonably frequently. The other judgments were altogether observed in 15% of the cases only and could be marked as exceptional.

2.3. Boxplot The boxplot corresponding to the stem & leaf display in Figure 2 is shown in Figure 6 together with the corresponding boxplot of brightness deviations (BRIREMAL: - , duller, + , brighter). A boxplot is a graphical illustration of summary statistics of the empirical

conpar i son-of-di s t r ibut ions 0.89

0.65

0. 6

-.‘ -.7

0.53 0.43

t

ntc

- .‘)I

:

hue deviation

Figure 6. Boxplots

b r i htness davhon

313

MULTIVARIATE EDA AND GRAPHICS

distribution of a quantitative variable. In the displayed version the lower and upper quartiles indicated by the boxes with the median as the inner line. Moreover, the mean is represented as a star and the lines above and below the boxes (the so-called ‘whiskers’) are drawn to extreme values not greater than the upper quartile plus 1 - 5 times the quartile difference and not less than the lower quartile minus 1 5 times the quartile difference respectively. The values above or below the whiskers, indicated by circles, are called outer points or ‘outliers’. Note that the boxplot is the only univariate graphic which explicitly marks ‘outliers’, but in the other graphics ‘far off’ points can also be identified. As an interpretation aid, the above type conformity limits corresponding to the colorimetric measurements are indicated by horizontal lines. One can see that according to brightness, nearly all batches are not type conform but brighter, which can easily be corrected by adding ‘dirt’. For qualitative variables boxplots may not be defined (nominal case) or might at least have interpretation problems (ordinal case with a small number of categories). Qualitative variables may be used, however, to split the sample in order to display parallel boxplots for the subsamples. Moreover, the relative subsample sizes may be indicated by the relative widths of the boxplots (proportional boxplots).

-

2.4. Quantile plot

Sometimes the stem & leaf display, the histogram or the box plot may suggest the kind of distribution of the univariate characteristic considered, e.g. as in Section 2.1. In other cases one may have the idea that a non-linear transformation of the observed characteristic belongs to a specified class of distributions. This may then be checked by means of a quantile plot, e.g. by an animated normal quantile plot. In a normal quantile plot the sorted normalized values of the characteristic are displayed against the normal quantiles of the expected uniform order statistics. If the characteristic were approximately normal, the plot should look approximately like a line. Otherwise, normality should be rejected. (Controlled) animation of a normal quantile plot relates to continuously transforming the displayed characteristic by changing one parameter of a family of

Power control

Scroll bar positions

7

-

Scroll bar positions

SEClTE

(SEC/TE)*+3

Figure 7. Animated normal quantile plot: secondaryltertiary main component

314

C. WEIHS

transformations, e.g. the power of the original variable. This may be realized interactively at a computer by moving a scroll bar between the extreme parameters by means of a mouse. If the computer is powerful enough, this will result in a smooth real time change from one normal quantile plot to the other. Figure 7 shows four power transformations of the variable SEC/TE. Only (SEC/TE)-' leads to an approximate line and may thus be approximately normal. Animation is an extremely valuable graphical tool which increases the power of many graphical methods (see e.g. Section 3.3). With one normal quartile plot alone one might have difficulties in judging linearity; the easy comparison delivered by animation trains the eye. With controlled animation one actively controls the graphics, with non-controlled animation one just passively follows a prepared series of views (see e.g. Section 4.1). Note that the quartile plot in a way builds a bridge between univariate and bivariate building blocks, since it is univariate, displaying one characteristic only, but appears in the form of a scatterplot, the main bivariate graphical tool. 2.5. Summary

The discussed univariate building blocks, namely the stem & leaf display, the histogram, the boxplot and the quantile plot, are presented so that they are displaying less and less information about the distribution of the studied variable. Figure 8 shows the terms discussed together with these basic blocks. histogram interpretation qualitativeversion

stratification bar &an Pareto diagram

quantile

boxplot

interpretation parallel boxplots

normality nonlinear transformations

Figure 8. Univariate building blocks

3. BIVARIATE BUILDING BLOCKS

Obviously, some special forms of the univariate building blocks introduced in Section 2 are already bivariate, namely the qualitative stem & leaf display because the stem variable is different from the leaves variable, the stratified histogram because stratification is caused by a qualitative variable, the parallel box plots showing more than one variable in parallel and the quantile plot showing a special kind of scatterplot. Moreover, the interpretation component in the stem & leaf display and the boxplots also represents a new dimension. In what follows, the use of bivariate building blocks for detecting dependences between two variables will be discussed.

3.1. Scatterplot The main graphical tool used to illustrate dependence or independence of two variables is the scatterplot. In its crude form a scatterplot displays the values of two variables, one variable on the horizontal (x-) axis, the other on the vertical (y-) axis. Scatterplots are often used to look for linear dependences. For this, a scatterplot is much more informative than the correlation coefficient. This should be clear from Figure 9. Note that stratification may also easily be included in scatterplots by using different symbols, colors or shades for the different strata.

315

MULTIVARIATE EDA AND GRAPHICS

0

10

20

0

10

10

0

10

20

0

10

20

Figure 9. Scatterplots with the same correlation 0.8'

3.2. Linear approximation In order to characterize the relationship between the two displayed variables, linear feusfsquares regression is the standard method to be applied to the scatterplot. In Figure 10(a) the linearity of the relationship between hue deviation under artificial light (HUEREMAL)and a) rHUEREHAL

.. LAMBDAC

-0.9

R2= .73

oo approximation (HUEREMAL) = 247 - 0.42 LAMBDAC 588.0

589.0

588.5

589.5

590.0

b) Rsaidual

0.4

(HOIML)

0.2 Approxlnatlon (HUEFU%IIRL)

-.2

.. ..

' ...

. .

-.

4

-.B

-.6

-.4

-.2

0.6

0.2

Figure 10. (a) Scatterplot with regression line. (b) Residual plot

316

C . WEIHS

centre-of-gravity wavelength (LAMBDAC) is demonstrated. The goodness-of-fit measure R2 is the squared correlation coefficient between HUEREMAL and its approximation and thus between HUEREMAL and LAMBDAC, since the approximation is a linear function of LAMBDAC only. Another way to illustrate the adequacy of linear smoothing is a residual (scatter)plot showing the regression residuals of a linear approximation against the approximation itself. If no structure can be seen, no systematic explanation factor is expected to be missing. Figure 10(b) is the residual plot corresponding to the linear approximation in Figure lO(a). Both Figures 1O(a) and 1O(b) seem to indicate that the linear approximation is adequate.

3.3. Transformation to linearity If the goodness-of-fit or the residual plot indicates the inadequacy of a linear approximation, non-linear dependences might be present or the two variables might be essentially independent. There are different kinds of non-linearities. The case where there are standard transformations of the variables so that the non-linear relationship becomes linear afterwards will be discussed in this subsection. More general non-linear functional relationships will be discussed in Section 3.4 and clusters in Section 3.5. A graphical test for independence will be introduced in Section 3.6. For the transformation to linearity the following ‘rules’ may be helpful. (1) If the fit of the present model is inadequate, then, transform the y-variable if the residual

variance is not constant for different x, or transform the x-variable if the residual variance is constant. (2) If the fit of the present model is adequate and the residual variance is not constant, then transform the x- and y-variables in the same way. The kind of transformation to be applied depends on the class of models in which one is interested. For example, one can try power or Box-Cox transformations. To determine the actual transformation, the above ’rules’ can be used to determine the variable@) to be transformed and then the form of the misfit can be considered according to the following ‘rules’ by Tukey. (3) If the power (4) If the power

‘visually identified’ relationship is concave/convex instead of linear, then the of the y-variable should be increasedjdecreased. ‘visually identified’ non-linear approximation bulges to the right/left, then the of the x-variable should be increased/decreased.

Concerning residual structure, these ‘rules’ only work if the residual variance is constantly increasing or decreasing. In the case of more irregular residual patterns an appropriate transformation to linearity should not be expected using power transformations. Note that the ‘rules’ only give the direction of change for a variable. The best power can be determined by means of (controlled) animation, i.e. by interactive change of the power in the right direction by means of a scroll bar. Figure 11 shows an example where logy is the right transformation to linearity; the fit of the linear approximation is not OK and the residual variance is constantly increasing. Thus transform the y-variable. The relationship is convex instead of linear. Thus decrease the power of the y-variable. Using animation, log y was identified as the ‘best’ transformation.

MULTIVARIATE EDA AND GRAPHICS

a.n

0.8

1.01

residual(y)

2.81

317

3.97

residual (logy) t.5

- .Z

4.a

l7.6

Figure 11. Transformation to linearity. Legend: y,, log yp are the linear approximations of y , logy

3.4. Smoothing

If a transformation to linearity does not seem to be possible, one might be interested in constructing a smooth (non-linear) curve characterizing the relationship between the two variables. There are linear and non-linear smoothers. With linear smoothers the smoothed value is constructed by a linear function of the observations. Examples are running means, kernel smoothers and even the least-squares line. lo For example, one version of the running mean smoother averages the y-values of five nearest neighbors (according to the x-variable) of the x-location in which the smoother has to be computed, e.g. the y-values in the x-location itself and in the two nearest neighbors on the left and on the right. Examples of non-linear smoothers are running medians and LOWESS (locally weighted regression smoothing scatterplots), which determines the smoothed value by weighted regression, assigning lower

z.e 1.W

I 1.9

I

2.08

I 2. se

I

SUMGRD

3.00

Figure 12. Smoothing by repeated LOWESS (spans 9, 9, 9)

318

C . WEIHS

weights to points with x-values further away from the interesting x-location. 1 1 , 1 2 For other smoothing techniques see e.g. References 1 and 13. All such smoothers define a window in the values of the x-variable and then rely upon a linear or non-linear function of the y-values corresponding to that window. Smoother windows may be defined by the actual number of neighbored observations (span) or the percentage of the overall number of observations to be included. Smoothers are often applied repeatedly. Figure 12 shows the relationship between SEC/TE and SUMGRD smoothed by applying LOWESS three times with span 9 each. The obvious non-linear relationship is represented quite satisfactorily. 3.5. Clusters

Clusters, including outliers, are a special kind of non-linear dependence. Clusters may have different shapes and orientations and are most easily identified by eye. A procedure for automatic identification of a fixed number of clusters in any number of dimensions is given e.g. in Reference 14. For an example of clusters in 2D scatterplots indicating an outlier group see Figure 21(!). Outliers may be looked at as a special kind of cluster. The identification of outliers is often part of ‘data cleaning’, a common pre-step to data analysis. 3.6. Independence

If there seems to be no dependence in the data, one may check for independence. Independence can be ‘graphically tested’ by comparing the observed scatterplot with a scatterplot representing the ‘distribution under the null hypothesis of independence’. l5 The latter plot is generated by superimposing repeated random permutations of the y-values against the x-values. If the two plots look qualitatively the same, the ‘null hypothesis’ cannot be rejected. Figure 13 illustrates this comparison with the two ‘nearly independent’ variables SUMDYE and STRREM (correlation 0.20) using eight superimposed permutations.

-.L

C..

..

.I

..

.... . I

“ superimposedpnnutcd scamplots

Figure 13. ‘Graphical test’ of independence: STRREM (y) versus SUMDYE ( x )

3.7. Presenting scatterplots

In order to interpret a non-linear dependence between two variables characterized by a curve in a scatterplot, one mainly judges the slopes of line segments to determine the rate of change of one variable as a function of the other variable. Unfortunately, such judgements depend strongly on the properties of the scatterplot, e.g. on the shape of the scatter plot, i.e. the ratio

319

MULTIVARIATE EDA AND GRAPHICS

of its height and width. Cleveland and McGill16 present a paradigm for graphical perception. They show, among other things, that the judgment of the relative size of the slopes of two line segments is most accurate when the shape of the scatterplot is chosen so that the median absolute slope of the judged line segments is 45". Using the wrong scatterplot shape might well lead to a misinterpretation of the graph. Cleveland and McGill16 demonstrate e.g. that only if the scatterplot shape is adequately chosen can one see that the yearly sunspot numbers between 1750 and 1950 rise more rapidly than they fall. Other properties of the scatterplot such as colors, symbols, etc. might also influence the interpretation. So-called viewpoint transformations such as recentering and rescaling may be very helpful to find an appropriate scatterplot representation. Buja et al. l 5 proposed the most complete class of such transformations. Animated viewport transformations that are particularly valuable for finding the best scatterplot shape are 'inverse expanding and shrinking', i.e. expanding in one direction and shrinking in the other with the same factor, and expanding and shrinking in one direction only. Buja et a1.15 give an impressive example of how trends, periodicities and exotic points of time series may be detected by view port transformations. 3.8. Summary

Bivariate building blocks mainly include identification of dependence and independence in scatterplots. Figure 14 shows the proposed analyses, from linear to non-linear dependences, together with the terms discussed with the analyses. linear regression

residual analysis

. clusters outliers

bansformatio to linearity

fit residual variance

linear nothear

Figure 14. Dependences in scatterplots 4. MULTIVARIATE EXPLORATORY GRAPHICS

The main goal of multivariate exploratory graphics is structure recognition, i.e. the uncovering of multivariate relationships. The main tools of multivariate exploratory graphics are 3D rotation and scatterplot matrices with brushing. Multivariate exploratory graphics is a very modern research field in statistics. The forefather of all the new methods, PRIM-9," was developed just 20 years ago when computers became fast enough to guarantee real time motion. Nowadays, powerful modern multivariate exploratory graphics software is available, e.g. systems only for multivariate exploratory graphics such as JMP, l 8 MACSPIN, l9 SASInsight 2o and XGobi, " but also programming languages including powerful graphical tools such as ISP,22S-PlusZ3and LISP-Stat.24 Multivariate exploratory graphics rely heavily on the univariate and especially the bivariate building blocks introduced in Sections 2 and 3. Indeed, essentially the idea of the 2D scatterplot was generalized to higher dimensions. Univariate building blocks are typically used to illustrate the 1D marginal distributions. 4.1. 3D rotation and the grand tour

3 0 rotation is nothing but a smooth rotation of 2D views (i.e. scatterplots) where the third dimension is indicated by the speed of the individual points during rotation. The further away

320

C . WEIHS

no

I

Figure 15. Structure recognition by rotation: IBM random number generator RANDU

a point is from the screen, i.e. the greater the co-ordinate orthogonal to the screen coordinates, the faster the point rotates. With sufficient computer power the rotation is fairly smooth and the viewer gets the impression of a smooth rotation in 3D. After having stopped the rotation, clearly the impression is 2D only. 3D rotation may well exhibit structure in 3 0 data. This can be impressively demonstrated by means of triples of random numbers generated by a linear congruential random number generator, e.g. by means of the IBM generator RANDU: xi+^ = 65597 xi(mod 231). Figure 15 presents snapshots of the rotation of the projection on the first two components around the y-axis near that view uncovering the hidden structure: 'Random numbers fall mainly in the planes'. 25 Unfortunately, the impression of parallel planes is very transitory, i.e. some degrees before and after the appropriate rotation angle (84") one would have difficulties with identification. Thus, with high-speed rotation or after constantly looking for some time, one might overlook what one might be interested in. This is particularly true for an (uncontrolled) animation technique (see Section 2.4) called the 'grand tour',26which tries to come close to presenting 'every 2D scatterplot' during 3D rotations in a high-dimensional space. Indeed, such grand tours suffer from the fact that in dimensions of five or more there is no guarantee of observing existing structure in a reasonable time period. In six dimensions, for example, it takes a minimum of 6 000 OOO 2D planes to get within 10" (only) of all possible planes.26 At a rate of ten planes per second, this represents 160 h of viewing time! Thus, although looking at 3D rotations is again and again visually attractive, the usefulness of such a technique may be doubted when there is no selection of views and the viewer is left alone to identify interesting structure. This is also true because data are bound to be sparse in high dimensions and spurious patterns are easily generated by projections. 4.2. Guided tours

In the literature there are different suggestions for overcoming the 'grand tour' problem. On one hand, various strategies for view selection are proposed, e.g. scatterplot matrices of the original variables, guided tours2' and the OMEGA strategy. 3 - 5 The scatterplot matrix combined with brushing is a very powerful, purely graphical tool for analysing not only the

MULTIVARIATE EDA AND GRAPHICS

321

original observed variables but also the selected transformed co-ordinates typical for guided tours and the OMEGA strategy. This will be discussed in Section 4.3. Guided tours and the OMEGA strategy are similar in that ‘criterion based’ dimension reductions are carried out to identify interesting subspaces. This will be discussed in Section 5 . On the other hand, automated structure recognition is proposed to reduce the problem of overlooking something as rare as structure among so much noise. One method proposed is exploratory projection pursuit, which tries to find low-dimensional projections that provide the most revealing views of high-dimensional data. 28 Since linear dependences are directly captured by the correlation structure, the emphasis of projection pursuit is on the discovery of non-linear dependences, including clusters. In order to quantify the rather vague notion of non-linearity, there are strong heuristic arguments to consider the most non-normal projections. 29 To measure this, different ‘projection indices’ were proposed, including well-known normality tests such as Kolmogorov’s as well as newly developed ones such as Friedman’s. 30 Projection pursuit methods share the emphasis on nonlinearities with guided tours and the OMEGA strategy but will not be discussed here. 4.3. Scatterplot matrix

As mentioned above, the scatterplot matrix is another purely graphical tool for multidimensional structure recognition. Scatterplot matrices present more than one view at a time. Different 2D scatterplots are shown simultaneously to improve the understanding of high-dimensional data. The basic idea is that of a generalized draftsman’s plot (front, top and side views) with co-ordinate-reflected views added.31 A scatterplot matrix of k variables is a ‘matrix’ of graphs with a scatterplot of the ith variable against the jth as the (i,j)-entry of the ‘matrix’, i # j , and with the diagonal free for additional information. Since this results in k 2 scatterplots, such an approach is only practical if k is not too large. The diagonal may be utilized to display univariate distributions by means of one of the univariate graphics discussed in Section 2, i.e. stem & leaf displays, histograms, boxplots or quantile plots. Also, there is no need to present the ‘pure’ scatterplots of the original variables. Each of the bivariate graphics discussed in Section 3 may be utilized, i.e. scatterplots with regression line, residual plots, nonlinear transformations, smoothed scatterplots or graphical tests on independence. In Figure 16 the scatterplot-matrix of five of the dyestuff production variables of Example 1 in section 1.2 is displayed. The diagonal is used to present a simple form of stem & leaf display of the involved variables. From the (1’2)-entry of the scatterplot one may suspect that there is a nonlinear relationship between SUMGRD and SEC/TE (see Section 3.4). This is also supported by the symmetry of the distribution of SUMGRD and the non-symmetry of the distribution of SEC/TE. Other possible relationships appear to be obscured by noise. One slight difficulty with stem & leaf displays at the diagonal of scatterplot matrices is proper scaling such that there is no overlap with the scatterplots. However, the number of branches can be controlled by concatenation (e.g. as with SUMGRD) or splitting (as with HUEVIAL) and the length of the branches can be controlled by displaying up to a maximum length only and indicating the number of left-out digits by ‘ + number’ (e.g. as with HUEREMAL). However, it is the visual link of corresponding points on the different scatterplots, the socalled brushing, which makes scatterplot matrices such a powerful tool for data analysis. With brushing,32a rectangle called a brush is superimposed on one of the scatterplots. This brush can be moved to different positions on the scatterplot matrix. There are four basic brushing operations: highlighting (or coloring), shadow highlighting (or shadow coloring), deleting and labeling. Each operation is not only carried out on the plot

322

C . WEIHS

.. ah

... .... .. ...- ........... --.... ............ .

l-

. . . . I ( * .

-tW

. . L

U-1

..

-.

Figure 16. Scatterplot matrix upon which the brush is superimposed but also on the corresponding points of all other scatterplots (linking). Typically, each operation can be carried out in one of three paint modes: transient, lasting and undo. Also, the size and shape of the brush may be changeable. With highlighting, the points underneath the brush are highlighted, at the same time highlighting the corresponding observations in the other scatterplots. With shadow highlighting, all highlighted points appear in all scatterplots except that one where the brush is acting. This reduces overlapping and may help in identifying highlighted observations. With deletion and labeling, the points underneath the brush are deleted or labeled respectively. In transient mode the brushing operation is undone automatically after the brush has left a point. In lasting mode the result of a brushing operation lasts until it is explicitly undone in undo mode. In Figures 17(a) and 17(b) the scatterplot matrix of Figure 16 (without stem & leaf displays) is shown after brush erase, in the former the matrix with the superimposed brush (first layer) and in the latter the erased points only, which are stored in a 'second layer' of the matrix. With brush erase those observations are erased which correspond to 'very slightly greener' visual judgments of hue deviation (HUEVIAL). The second layer shows that the corresponding observations of the other variables are more or less central, too (see Section 2). Becker et al. 33 compared brushing a scatterplot matrix with rotation analysis of multivariate

MULTIVARIATE EDA AND GRAPHICS

323

:. I

.

.

I.

I ! . ........... I I -

I

Figure 17. (a) Scatterplot matrix after brush erase (first layer). (b) Scatterplot-matrixafter brush-erase (second layer)

data. Their conclusions were that rotation is most revealing in those cases where higher-thantwo-dimensional structure is present, whereas brushing is particularly appropriate for studying dependences of two of the chosen variables for different values of a third variable. The concept of 'square' scatterplot matrices can be easily extended to non-square matrices, even with different sets of variables representing rows and columns. Here the (i,j)-entry of the matrix is the scatterplot of the ith row variable against the jth column variable. " In this case there is no free diagonal for additional information, but the same graphics may be placed at the plot margins. 23

C . WEIHS

4.4. Summary

3D-rotation and scatterplot matrices are purely graphical tools used to uncover structure in high-dimensional data. Both methods are especially useful when applied to variables which might be suspected to relate in interesting ways. In the next Section such ‘special views’ will be constructed by means of certain criteria which might stand for ‘interestingness’. Figure 18 shows the two methods together with the terms discussed with them.

3LLmtation grand tour guided tours

scatterplot-matrix brushing

Figure 18. Multivariate exploratory graphics 5. EXPLORATION BY CRITERION: DIMENSION REDUCTION In order to work efficiently with multivariate graphical techniques, clever dimension reductions would be very helpful (see Sections 4.1 and 4.2) - but what criteria should such a dimension reduction fulfil in order to improve the probability of finding interesting structure?

5.1. Dimension reduction methods

One possible criterion is used with principal components analysis (PCA): identify that direction in space where data are varying most; then do the same in the space orthogonal to the first direction, etc., until the new coordinates generated this way, called principal components, explain an ‘adequate portion’ of the overall variation in the data, i.e. of the sum of the sample variances of all involved variables.34 Since PCA is such a popular dimension reduction method, it is mainly discussed in this section. However, note that there are at least two versions of PCA and that there are competitors. Both PCA versions work on mean-centered variables. One PCA version works with the original variables in their original scale (PCA on covariances, PCA-COV), the other with variables standardized to standard deviation one (PCA on correlations, PCA-COR). Competitors are all multivariate methods leading to dimension reductions by different criteria. Examples are factor analysis and all the methods discussed in Section 6, where dimension reduction is a byproduct of looking for good prediction models. The analogue to PCA for qualitative data is multiple-correspondence analysis (MCA). A very recent application of this method is the comparison of the distribution of the granules of ten commercial starch species by means of 2D representation of six characteristics obtained by image analysis. 35 Sometimes MCA is called homogeneity analysis, in particular in psychometric applications where qualitative (categorical) data are quite typical. In this area non-linear PCA was also developed. Another key reference for linear and non-linear dimension reduction methods is Reference 36. In Sections 5.2 and 5.3 the general procedure of the search for and interpretation of structure with principal components analysis will be discussed. An example will be given in Section 5.4.

MULTIVARIATE EDA AND GRAPHICS

325

5.2. Stability of structure PCA dimension reduction leads to a (hopefully) low-dimensional space representing most of the variation in the data (say 95%). Such dimension reduction is only sensible, however, if the corresponding projection of the data (scores) is stable according to small changes in the database. Only then can a possible structure be trusted to be real. Thus one tries to determine the number of dimensions in such a way that the ‘pointwise regions of variation’ generated by superimposing ‘resampled’ projections are small relative to the overall variation. Resampling refers to drawing different samples from the original data set and applying PCA or any other multivariate method to these samples. One possible realization is the generation of Nsamples of size N - 1 by leaving out one observation after the other of the original sample (cross-validation).Having generated repeated projections by resampling, one has to determine the corresponding pointwise regions of variation. Unfortunately, the different projections will very probably correspond to different co-ordinates, i.e. principal components. However, since one is only interested in ‘qualitative stability’, one may accept that two projections are alike if they only differ by rotation, translation or overall scaling. Indeed, such transformations may be interpreted as just identifying viewpoints from which the projections look the same. Thus, in order to be able to compare the different ‘resampled’ projections, a so-called Procrustes analysis should be applied to each such projection, i.e. each projection should be transformed individually by that rotation, translation and overall scaling generating the best coincidence with the projection based on the complete original sample. 37 After having superimposed the projections transformed in this way, pointwise regions of variation are estimated based upon a normality assumption for the replicates of the individual points. Thus the 100 (1 - a)Vo regions of variation, 0 < a < 1, are ellipsoids centered at the sample means with size and orientation depending on the sample covariance matrices. Having drawn the ellipsoids in a scatterplot (matrix), one may easily judge stability by comparing their size with the overall variation in the projected data. This kind of stability analysis is part of the OMEGA strategy.4. After having generated a ‘stable’ projection, one can look for interesting structure. Possible structures are ‘clusters’ and ‘non-linear functional relations’ such as arcs, but never linear relationships because of the orthogonality of the principal components. If the number of dimensions is small enough (one to three), pure inspection using the tools in sections 2 to 4 may be adequate for the search for structure. Also, the bivariate building blocks in Section 3 can be used for the identification of non-linear dependences and independences. 5.3. Interpretation of structure

After having found interesting structure, one normally wants to interpret it. However, since principal components are in general linear combinations of all original variables, such interpretation might not be easy in most cases. One possible way out is to identify for each principal component those original variables which ‘explain’ most of the variation in the component. This can be realized by a stepwise (forward selection) regression procedure optimizing the goodness-of-fit. A simpler, but probably less efficient, procedure is importance ranking, i.e. the ranking of the original variables by the size of the absolute values of their loadings in the linear combination representing the principal component. In both cases the first, the first two, etc. of these variables are fitted to the principal component until the goodness-of-fit is better than 95% or the fit improvement is minor. In both cases the PCA has to be repeated with the important original variables only, in order to identify the new loadings

326

C. WEIHS

because of possibly high correlation between the original variables. A procedure using importance ranking is part of the OMEGA strategy. A simple graphical check is provided by scatterplots of the original versus the simplified principal components after Procrustes analysis. If such plots are near to diagonal, the simplification should be acceptable. A purely graphical tool for identifying ‘important’ original variables is the biplot. 38 This is a special scatterplot (matrix) representing not only the projections (scores) on two principal components (each) by points but also the 2D loadings of the original variables by lines. The longer such a line is in the direction of one principal component, the more important is this variable for the principal component. The relation between the biplot and the above importance ranking should be obvious. 5.4. Example

All the above ideas of dimension reduction by PCA will be illustrated by applying PCA-COV to all 29 variables of the dyestuff example in Section 1.2. Since the scales of the observed Table 3. Dimension reduction and loadings ~~

Goodness-of-fit (070) 56 93 96 98 99 100 100

Variable 1 2 3 4 5

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

PC 1 (scaling factor of loadings 10**0)

TERCUP DNANDI DNBZDI SECMC TERTMC PRIMSC SECSC DNSEC TERTSC SUNKUV SUMUV SUMRED SUMGRD SUNKDY SUMDYE TOTORG SEC/TE LMBDAC STRVI STRREM STRTRA HUEVI HUEVIAL HUEREM HUEREMAL BRIVI BRIVIAL BRIREM BRIREMAL

0.0003

PC2 - 0.0043

- 0.0057

- 0.0024

0.0021 - 0.0174 0.0253 -0*0020 - 0.0041 - 0~0001 -0.0047 0.0104 0.0006 - 0.0234 0.0205 - 0.0078 -0.6829 - 0 * 0996 -0.0025 0.0022 -0.4309 - 0.4323 -0.3859 -0.0007 -0.0024 -0*0001 -0.0008 0.0019 0.0035 0.0005 0~0010

- 0-0067

-0-0047 0.0328 - 0.0033 -0.0007 0 * 0007 -0.0081 - O.OOO7

-0.0158 - 0.0084 0-0245 0.0003 -0.7060 - 0.1499 -0.0017 0.0014 0.4106 0.4180 0.3656 0.0002 - 0.0049 - 0.0002 - 0.0009 0 * 0004 0*0040 0-0008 0.0007

MULTIVARIATE EDA AND GRAPHICS

327

Figure 19. Largest principal components Table 4. Stability: means and standard deviations of principal components PC 1 Variable 1

2 3 4 5 6

7 8 9 10 11 12 13 14 15 16 17

18 19 20 21 22 23 24 25 26 21 28 29

TERCUP DNANDI DNBZDI SECMC TERTMC PRIMSC SECSC DNSEC TERTSC SUNKUV SUMUV SUMRED SUMGRD SUNKDY SUMDYE TOTORG SEC/TE LMBDAC STRVI STRREM STRTRA HUEVI HUEVIAL HUEREM HUEREMAL BRIVI BRIVIAL BRIREM BRIREMAL

Mean

PC2

Std

O-OOO O.OO0 0.006 O.Oo0 -0.002 O*OOO 0-017 - 0.025 0.002 0.004 0-0o0 0.005 - 0.010 - 0.001 0-023 -0.020 0.008 0-683 0.100 0.002 -0.002 0.431 0-432 0.386 0-001 0.002 0*000 0.001 -0.002 - 0.003

0.001

-0.001

0.OOO

0.001 0*000

0.OOO 0-OOO 0.000 0*000 0.001

0.001 0.001

O*OOO O.Oo0 0.002 0~000 0*000

0.001 0.001 0.003 0.000

Mean 0.004

0.002 0.007

0.005 - 0.033 0.003 0.001 - 0.001 0.008 0.001 0-016 0.008 - 0.024 0 - 000 0.706 0-150 0.002

- 0.001 - 0.41 1

Std 0-000 0.000 0~000 0.001

0.001 0.000 0.000 0.000

0.001 0.001 0.001 0.002 0.002 0.001 0.001 0.002 0.000 0-000 0.001

-0.418 -0.366

0.001 0.002

0.000

0.000

0.OOO

0.005

0.000

O.Oo0

0*000 0.001

0.000 0~000 0.000

0.OOO 0-OOO 0*000

- 0.001 O.Oo0

O.Oo0

- 0.004 o*ooo - 0.001 0.000 - 0.001 O.fl00

C. WEIHS

328

I

pc2

.

.*.

. '. PC2

I

PC1

Figure 20. (a) Stability check by superimposed projections. (b) Variation of one object: 90% ellipse

variables appear to be very different and the data are known to fall into two categories, namely analytical properties and quality measures, such an overall analysis of raw, unscaled data might appear unwise a priori. Note, however, that the analysis is supposed to be exploratory and any kind of structure is of interest, and that experience shows that such analysis often leads to valuable insights into the data structure. Table 3 shows the loadings of the original variables in the first two principal components which explain 93% of the overall variance in the data (goodness-of-fit). Figure 19 shows the corresponding projection (scores). In order to confirm the apparent structure (three parallel clusters), a stability analysis is carried out leading, after Procrustes analysis, to Table 4 and Figure 20(a). As suspected from the standard deviations (Std) of the loadings in Table 4, the variation in the scores is very small, leading to nearly no difference between Figures 19 and 20(a). That there is indeed variation is shown in Figure 20(b), which is a zoomed version of the marked region of Figure 2qa). Note that the variation is so small in Figure 20(a) that even the 90% ellipses are not identifiable. After having checked stability, the structure in Figure 19 is believed to be three line-shaped parallel clusters. As is partly indicated in Figure 21, they belong to three successive parts of

329

MULTIVARIATE EDA AND GRAPHICS

PC2

.*

..

Figure 21. Structure: clusters

production, the first twelve batches being bottom left, the next 34 in the next parallel cluster and the rest in the top right ‘line’ (except batch 84, which is marked batch in Figure 20(a)). With importance ranking two original variables were identified explaining around 99% of the variation in the data (Table 5 ) . This is confirmed by the biplot in Figure 22, which also illustrates that the shift of the parallel clusters is along the ‘SUMDYE axis’. The scatterplot matrices in Figures 23(a) and 23(b) show (among other things) that original and simplified Table 5 . Simplification by importance ranking ~

PC1 Variable SUMDYE STRREM STRVI STRTRA

R2 0.5858 0.9917 0.9931 0.9995

PC2 #

var.

Variable

1 2 3 4

SUMDYE STRREM STRVI STRTRA TOTORG

Figure 22. Biplot

R2 0.4124 0.9882 0.9894 0-9986 0.9999

#

var. 1 2

3 4 5

330

C . WEIHS

Figure 23. (a) Original components and simplifications before Procrustes analysis. (b) after Procrustes analysis

140.

-

. .. .

.:'.

..

*.

..

.. ..

80. 60. 0.

20.

40.

60.

80.

100.

Figure 24. Important variables

MULTIVARIATE EDA AND GRAPHICS

analysis on covariances 13 clusters ?)

cross-validation

3 parallel

Procrustes analysis

belonging to succeeding parts of production

very small regions of variation

33 1

simplification : 2 variables imponant : SUMDYE, STRREM biplot :direction of clusters :SUMDYE independenceof SUMDYE, STRREM

Figure 25. PCA example

principal components only coincide after Procrustes analysis (note the deviations from the ‘ideal’ diagonals), but produce ‘qualitatively’ the same projections also before Procrustes analysis. Finally, the two original variables identified to be important are plotted against each other (Figure24). A quick look leads to the suspicion of independent factors, which is supported by a graphical test of independence (Figure 13). This was reported to the producer, who was somewhat annoyed by this result since he had expected a strong relationship between the sum of dyestuffs in slurry (SUMDYE) and the color strength on polyester (STRREM); but indeed, reconsideration led to the insight that SUMDYE was measured too early, namely before a strength-adjusting dilution was carried out, The example is summarized in Figure 25. 5.5. Summary

PCA and other dimension reduction methods deliver special views of high-dimensional data. After having confirmed their stability under resampling, a structure possibly identified in such a view might be interpreted by means of loading simplification. The essential steps of the procedure are illustrated in Figure 25 too. 6. PREDICTION

6.1. Prediction methods Finding prediction models is the second important goal of multivariate exploratory data analysis. Such models build a bridge from exploratory data analysis to confirmatory data analysis. What one tries to find are relationships optimally predicting a set of responses by a set of predictors. If one has no theoretically founded model available that has to be confirmed, one first has to look for a candidate model by exploration. All what follows applies to prediction models which are linear in the parameters to be estimated but possibly non-linear in the predictor variables. Models may thus include not only the original predictor variables but also non-linear functions of them such as transformations of individual predictor variables or interactions, i.e. products of pairs of predictor variables. For ease of notation the term predictor will be applied to the original predictor variables as well as to transformations included in the model. Note that it is assumed that the set of predictors is known before the analysis. Thus transformations of interest have to be fixed in a pre-step. Good prediction is sometimes confused with good fit. Indeed, small model residuals for the objects upon which the model is based do not automatically imply small model residuals for

332

C. WEIHS

new objects, i.e. for objects upon which the model is not based. If the goal were good fit, then the optimal method would be clear: multivariate multiple regression (MMR) delivers the optimal fit according to the least-squares criterion. For prediction, however, the situation is not that clear. MMR might lead to ‘overfitting’ and other multivariate techniques delivering poorer fits might lead to better predictions. As competitors to MMR, in this paper the following ‘prediction methods’ will be discussed: canonical correlation analysis (CCA), (multivariate) partial least-squares regression (PLS), principal components regression (PCR) and redundancy analysis (RDA). CCA, PLS and RDA identify linear combinations of the predictors which maximize correlation or covariance with corresponding linear combinations of all responses or minimize (Euclidean) distance to the vector of responses respectively. PCR applies MMR to the outcome of PCA applied to the predictors. MMR corresponds to multiple regressions on all predictors individually for all responses. For more detailed information about these methods see References 39-41. Note that the way PCR is understood here is sometimes critized, since the choice of the principal components to be included in the model is not based on the responses. Instead, it is proposed to select principal components relevant for prediction by means of the correlations between a response and the principal components. 42 All these methods apply to continuous responses. If there is a response classifying the objects into a small number of classes, then (canonical) linear discriminant analysis (LDA) can be used to predict this classification by means of as few as possible linear combinations of the predictors. Non-linear versions of CCA are e.g. OVERALS and CANALS.’ These methods are particularly adapted to qualitative responses and predictors. For continuous responses Friedman43 proposed MARS (multivariate adaptive regression splines) as a method particularly adapted to models with interactions. Non-linear alternatives to LDA are quadratic discriminant and class-modelling methods. 44 An example of the performance of such ‘parametric’ alternatives to LDA and some ‘non-parametric’ competitors such as convex hulls is given in Reference 45. 6.2. Prediction rules

The predictive power of the methods can be measured by a comparison of predictions and observed values. One way to get predictions for the complete sample is again cross-validation, leaving out one observation after the other of the original N observations, generating a ‘prediction rule’ by means of the other N - 1 observations, and predicting the left-out observation using this prediction rule. In the case of a classified response the misclassi$cation rate is a measure for predictive power; in the case of continuous responses the measure is crossvalidated R2 (Rf,) values, i.e. the proportion of the variance of the responses explained by the predictions. For the relationship of R2cyto other measures of ‘predictive ability’ see e.g. Reference 46. The construction of prediction rules typically starts with a dimension reduction in predictor space leading to new co-ordinates. Then, in the case of continuous responses, the rules may be determined by regressing the responses one-by-one on these new co-ordinates. Note that in this way different portions of the information in the responses are used for the prediction of one response (see Section 6.1). (i) The methods CCA, PLS and RDA use information about all responses, since all responses are involved in the selection of the corresponding co-ordinates. (ii) The methods MMR and PCR only use information about the response to be predicted,

MULTIVARIATE EDA AND GRAPHICS

333

since the selection of the co-ordinates used in the regressions is independent of all responses. For classified responses prediction rules are constructed one-by-one for each response, dividing the predictor space into sections, each corresponding to one and only one of the classes. A method like this for the determination of predictive power can be found in Reference 4. Note, however, that besides the above prediction-generating cross-validation in the literature resampling is also proposed for the determination of the number of dimensions in predictor space and of prediction rules. For example, for the number of dimensions cross-validation was proposed4’ and for prediction rules the jackknife and the b ~ o t s t r a p . ~This ’ might lead to a nested three-stage resampling scheme as follows. During prediction-generating cross-validation one observation is temporarily left out, creating the actual subsample on which the prediction rule generation is based. With prediction rule generation from this subsample bootstrap subsamples are drawn, for each of these subsamples a prediction rule is generated and the corresponding ‘mean’ prediction rule is used instead of the prediction rule based on the actual cross-validation subsample only. Moreover, during the determination of the prediction rule for each of the bootstrap subsamples a further cross-validation is used to determine the relevant dimension of predictor space, namely that dimension with the maximum predictive power on the analysed bootstrap subsample. In other words, from each subsample of prediction-generating cross-validation bootstrap subsamples are drawn and for each bootstrap subsample and each predictor space dimension another cross-validation is carried out. Prediction rules are generated for each bootstrap subsample and each predictor space dimension. The best predictor space dimension is selected individually for each bootstrap subsample by comparing the cross-validated predictive power for the different dimensions. The corresponding prediction rules are ‘averaged’ over the bootstrap subsamples in order to generate the prediction rule for the actual subsample of predictiongenerating cross-validation. This ‘mean’ prediction rule is then used to generate the prediction for the left-out observation. Large data sets might well be split into a learning set and a test set. Then, instead of using cross-validation, the prediction rules might be based on the learning set and the predictions might be generated for the test set. In this case the above scheme is only two-stage: the bootstrap subsamples are drawn from the learning set and the determination of predictor space dimension is based on the test set. An example of the learning set / test set method using jackknifing and bootstrapping for an LDA subsectioning of the predictor space can be found in Reference 49. In this case the ‘mean’ prediction rule relates to a subsectioning based on the means of the estimated group means over all bootstrap subsamples and on the corresponding tolerance regions. Since such multistage resampling schemes generally result in an intolerable amount of computer time, they are normally avoided. In what follows, prediction-generating crossvalidation is used, the number of dimensions in predictor space is determined only once based upon the complete sample and fixed by a goodness-of-fit criterion (95% of the total predictor space variance, say), and the prediction rules are not resampled but are determined only once based on the complete sample and on each subsample in prediction-generating cross-validation respectively. 6.3. Dimension reduction in predictor space

Since dimension reduction is an essential part of the above procedure for prediction model construction, all the prediction methods could also be (mis-?)used for the search for structure

334

C. WEIHS

in the predictor space. In this respect the prediction methods are competitors to PCA in the predictor space (see Section 5.1). Note that PCA is the dimension reduction part of PCR. 6.4. Variables selection

Up to now the complete set of predictors has been used for determining predictive power. However, there are reasons not to use a larger set of predictors than necessary.” (i) Prediction stability: the variance of the predictions is proportional to the number of predictors (at least for MMR). (ii) Multicollinearity: if the predictors are nearly collinear, predictions might become unstable. (iii) Interpretation: it is easier to interpret results from parsimonious models. The easiest way to solve the resulting problem of variables selection among the predictors is to apply one of the simplification procedures of Section 5.3 to the predictor space, namely stepwise regression or importance ranking. This is demonstrated in the examples in Section 6.6. For PCA (and PCR) importance ranking relates to ranking the original predictors by the absolute value of their loadings (see Section 5.3). For PLS the same ranking procedure can be motivated, whereas for the other prediction methods an approximate importance ranking has to weight such values by appropriate standard deviations4. These approximations, however, cannot really be recommended, since they are only valid if the correlations between the original variables are not ‘high’. An example of using cross-validation to optimally determine the predictors with the methods CCA, RDA, and MMR can be found in Reference 50. 6.5. Graphical comparison of outcomes

Since one does not know a-priori which prediction method is ‘best’, the comparison of the results is essential. Also here, graphical comparison leads to a better-founded choice than just comparing the summary statistics ‘predictive power’. In principle, the results of two methods can be graphically compared in two ways: (i) parallel views (ii) superposition and interpolation.

Typical applications are (i) parallel views of ‘diagonal plots’, i.e. of plots of predictions versus observations (ii) superposition and interpolation of prediction residuals. With parallel views the plots corresponding to the different methods are displayed in parallel. With superposition the plots are superimposed, outcomes of different methods are marked differently and are line-connected (‘ = interpolation’). Obviously, these graphical comparison methods can also be applied to the outcomes of the dimension reduction methods for the comparison of different projections. 6.6. Examples

All the above ideas for prediction model generation and comparison will be demonstrated by means of some examples using the data introduced in Sections 1.2-1.4. Before application of

335

MULTIVARIATE EDA AND GRAPHICS

the prediction methods all variables are standardized to standard deviation one. In the reported models, however, the variables are unscaled. Note that in all examples the predictors considered only include the original predictor variables, i.e. no transformations (see Section 6.1). The first example relates to the dyestuff production data (see Section 1.2) and is an improved version of results in Reference 4. The question of interest to the producer is the following: can the hue differences of dyestuff batches from the standard be predicted from the analytical measurements of the batches and, if so, how? The motivation behind this question is the desire to control the production process in order to produce dyestuffs with the required hue. Batches which fail to come up to standard require costly and time-consuming corrective treatment. Let us thus try to find predictive models for hue differences as derived from measured reflectance spectra for daylight (HUEREM) and artificial light (HUEREMAL). Since these responses are continuous, the competitor prediction methods are CCA, MMR, PCR, PLS and RDA. As predictors the 18 analytical variables are used. Table 6 clearly shows the inadequacy of R 2 for selecting the best model for prediction. With this criterion, naturally, the MMR-model would have always been chosen, whereas the predictive power (represented by R&) of the PLS model is distinctly best for HUEREMAL. Also, the predictive power would have been very much overestimated by using R 2 as a measure. Note that PCR is always only slightly worse than PLS but that PLS is distinctly better than CCA used in Reference 3. Figure 26 shows parallel plots of observations versus 1D predictions of the different prediction methods. Note that these plots might suggest that for HUEREM the variation of the predictions tends to be much smaller than the variation of the original observations and that for HUEREMAL the lower predictive power is mainly caused by some 'outliers'. Anyway, the PLS 1D projection PLSl of the analytical variables was chosen for model building. Importance ranking leads to a simplified first PLS component PLSlS composed of only three original variables (goodness-of-fit > 0.99) and the following models:

+ 0.0827 * SUMRED - 0.0835 * SUMGRD + 0.01 13 * SUMDYE + 0,1515 * PLSlS) HUEREMAL = - 0.2073 + 0.1755 * SUMRED - 0.1773 * SUMGRD + 0,0239 * SUMDYE ( = -0.2073 + 0.3217 *PLSlS) HUEREM = -0.0808 ( = -0.0808

Table 6. Example 1 : predictive power (R f v ) and fit (R 2 , Dimension 1 PCR

PLS

CCA

RDA

Dimension 2

Dimension 8

PCR

PCR

PLS

PLS

MMR

0.49 0.66

0-70

HUEREM (hue, reflectance spectra, daylight) Rf V

0.38

R2

0.40

0.41 0.45

0.36 0.52

0.41 0.67

0.37

0.44

0.46

0.41

0.62

0.60

0-39

HUEREMAL (hue, reflectance spectra, artifrcial light) R v: R2

0.70 0.70

0.72 0.74

0.60 0.83

0.58

0.68

0.72

0-73

0.74

0.61

0.83

0.70

0.80

0.80

0.83

0-85

336

C. WEIHS

G”1 Hue daylight

-. .

Hue artificial light

method MMR

0.39

0.61

PCR 0.38 0.70

PLS 0.41

0.72

CCA 0.36 0.60 RDA 0.41

0.58

R2C”

Figure 26. Parallel views: original variable ( y ) versus 1D predictions ( x ) where SUMRED = sum of all red dyestuff components SUMGRD = sum of all green dyestuff components SUMDYE = sum of all colored dyestuff components Note the similarity of the resulting models in the original variables: HUEREMAL is about twice HUEREM. Indeed, similar models could have been hoped for, since hue under daylight and artificial light should be very much dependent. Also, the models have to be similar because of their structure, but there is no formal need for a similarity like this. Since the PLS 1D R:v-values for the simplified model are 0.43 and 0-72 respectively, i.e. not worse than the corresponding R&-values 0.41 and 0.72 of the non-simplified PLS model, both HUEREM and HUEREMAL appear to be best predictable by means of a simple (linear) combination of the analytical properties which is easily interpretable. Green and red components of the dyestuff have opposite influences on hue differences, green (a bit) more so than red. This simply relates to the fact that hue differences are expressed as ‘greener’ and ‘redder’ relative to the standard and that batches are mainly produced on the ‘greener’ side. That the sum ef all dyestuffs is selected also might indicate that other coloured components play a role too. Unfortunately, the best R&-values appear to be so low that the usefulness of the above models may be doubted. In the first example the PLS and PCR predictions are superior to the CCA, MMR and RDA predictions. In the second example (also an improved version of an example in Reference 4)

337

MULTIVARIATE EDA AND GRAPHICS

Table 7. Example 2: predictive power (Rf,) and fit (R’) Dimension 1 PCR

PLS

CCA

RDA

Dimension 2

Dimension 8

PCR

PLS

PCR

PLS

MMR

PGS

Rb

0.24

0.31

0-54

0.59

0.23

0.30

0.29

0.40

0.53

R2

0.24

0.47

0.64

0.75

0-35

0.55

0.61

0-82

0.84

0.56 0.88

0.80 0.90

AdA

:,

R R2

0.24

0.38

0.26

0.55

0.76 0.89

0-69

0.29

0.43

0.47

0.82

0.35

0.81

0.86

the result is just the other way round. It relates to the Voltaren analogs data described in Section 1.3. Considering the R:v-values in Table 7, this time PCR and PLS are not competitive. For PGS the 1D RDA projection appears to be best predictive, for AdA the MMR projection and the 1D CCA projection. Note that the models corresponding to the 2D CCA and RDA projections are identical to the MMR model. In this respect the 1D CCA model is of lower dimension. A graphical comparison by superposition and interpolation of the prediction residuals of PLS and CCA for both PGS and AdA shows that large PLS errors are nearly systematically decreased by CCA. To see this, one may superimpose origin-centered circles and count the number of PLS and CCA residuals outside (Figure 27). Moreover, note that the best R2cv-valueslie around 0.8 for AdA. This may be an acceptable result. In the first example the simplified and non-simplified models lead to about the same predictive power. In the third example (this time adopted from Reference 5 ) simplification

1.09

0.31

-0.47

-1.25

-2.03

-2.81 -1.72

-1.08

-0.44

0.20

Figure 27. Plot interplolation: residual plot of PLS

0.84 (0)

and CCA

(a)

1D predictions

C. WEIHS

338

1 3

1

*? +I

1st discriminant component

Figure 28. Misclassification before simplification: observed classes

*,

+ ; predicted

classes 1 , 2

leads to a drastic improvement in predictive power. It relates to the 'drug in development' data described in Section 1.4. The prediction method used is LDA, because the response is categorical. Using the non-simplified first discriminant component, i.e. the first linear combination delivered by LDA, the estimated misclassification rate of 27% is definitely not satisfactory (see Figure 28). The application of stepwise regression (forward selection using R as the criterion) leads to the selection of the variables Pi4, POS2, S6 and ES6 (in that order) explaining more than 90% of the variation of the first discriminant component. Surprisingly, applying LDA only to these variables results in a totally correct classification (misclassification rate OVo)! Since the ultimate goal is to maximize biological activity, the corresponding (simplified) first discriminant component has to be maximized: maximize 4-3*S6+2-8*ES6+2.8*Pi4-5*2*POS2 where

S6 = (electronic) Hammet constant for position 6 ES6 = (steric) Taft constant for position 6 Pi4 = lipophilicity in position 4

POS2 = indicator for a substitution of H in position 2 Note that such an optimization typically leads to an extrapolation of the model into unknown territory. This sometimes results in disappointing outcomes. In this example however, the biological activity of the drug could have been distinctly improved with the help of the above optimization instruction. The procedure underlying the examples is summarized in Figure 29.

compan prediction methods cross-validation predictivepow-

variablca sektim

stepwise regression importance d i n g

model check interpretation extrapolation

Figure 29. Prediction examples

MULTIVARIATE EDA AND GRAPHICS

339

6.7. Summary The comparison of prediction methods by means of resampling leads to models for the prediction of responses by predictors. The procedure described above essentially corresponds to models linear in the parameters, but possibly non-linear in the variables. For the essential steps of prediction model identification see also Figure 29. 7. CONCLUSION

This tutorial has focused on univariate, bivariate and multivariate exploratory data analysis for quantitative data using linear methods for dimension reduction and prediction. All the discussed methods are routinely used in chemometric applications in CIBA-GEIGY. Methods special for qualitative data, non-linear methods and multiblock methods have up to now not been very important in the author’s work. ACKNOWLEDGEMENTS

I would like to thank A. F. M. Smith for his suggestions for structuring the paper. Also, I would like to thank two referees for their comments.

REFERENCES 1 . J . W. Tukey, Exploratory Data Analysis, Addison-Wesley, Reading (1977) 2. F. W. Young, D. P. Kent and W. F. Kuhfeld, in Dynamic Graphics for Statistics, ed. by W. S. Cleveland and M. E. McGill, pp. 391-424, Wadsworth, Monterey (1988). 3. C. Weihs and H. Schmidli, Stat. Sci. 5 , 175-226 (1990). 4. C. Weihs and H. Schmidli, Mikrochim. Acta, 11, 467-482 (1991). 5 . C. Weihs, in Methoden und Werkzeuge fur die Exploratorische Datenanalyse in den Biowissenschaften, ed. by E. Enke, J . Golles, R. Haux and H.-D. Wernecke, pp. 1 1 1-127, Fischer, Stuttgart (1992). 6. S. J . Senn, ‘Lucifer’s lexicon: further definitions’, mimeo (1988). 7. A. Gifi, Nonlinear Multivariate Analysis, Wiley, New York (1990). 8. P. Moser, A. Sallmann and I. Wiesenberg, J. Med. Chem. 33, 2358-2368 (1990). 9. F. J . Anscombe, Am. Stat. 27, 17-20 (1973). 10. A. Buja, T. Hastie and R. Tibshirani, Ann. Stat. 17, 453-555 (1989). 1 1 . W. S. Cleveland, J . Am. Stat. Assoc. 74, 829-836 (1979). 12. W. S. Cleveland, The Elements of Graphing Data, Wadsworth, Monterey (1985). 13. W. Polasek, Explorative Daten-Analyse, Springer, Berlin (1988). 14. E. Trauwaert, P. Rousseeuw and L. Kaufman, Proc. 16th Con$ of the German Society of Classification, Dortmund, April 1992. 15. A. Buja, D. Asimov, C. Hurley and J . A. McDonald, in Dynamic Graphics f o r Statistics, ed. by W. S. Cleveland and M. E. McGill, pp. 277-308, Wadsworth, Belmont (1988). 16. W. S. Cleveland and R. McGill, J. R . Stat. SOC., Ser. A , 150, 192-229 (1987). 17. M. A. Fisherkeller, J . H. Friedrnan and J . W . Tukey, in Dynamic Graphics f o r Statistics, ed. by W. S. Cleveland and M. E. McGill, pp. 91-109, Wadsworth, Belmont (1988) (appeared originally in 1975). 18. JMP User’s Guide, SAS Institute, Cary (1989). 19. A. W. Donoho, D. L. Donoho and M. Gasko, in Dynamic Graphics for Statistics, ed. by W. S. Cleveland and M. E. McGill, pp. 331-351, Wadsworth, Belmont (1988). 20. SASINSIGHT User’s Guide, Version 6 , SAS Institute, Cary (1991). 21. D. F. Swayne, D. Cook and A. Buja, I991 Proc. of the Section on Statistical Graphics, American Statistical Association, Alexandria (1 992), pp . 1-8. 22. ISP User’s Guide (for Apollo Domain Version 5.OD), Artemis Systems, Carlisle (1988).

340

C. WEIHS

23. S-PLUS User’s Manual, Statistical Sciences, Seattle (1988). 24. L. Tierney, LZSP-STAT: A n Object-oriented Environment for Statistical Computing and Dynamic Graphics, Wiley, New York (1990). 25. G. Marsaglia, Proc. Acad. Sci. USA, 61, 25-28 (1968). 26. D. Asimov, SZAMJ. Sci. Stat. Comput. 6, 128-143 (1985). 27. C. Hurley and A. Buja, SIAM J. Sci. Stat. Comput. 11, 1193-1211 (1990). 28. J. H. Friedman, J. Am. Stat. Assoc. 82, 249-266 (1987). 29. P. J. Huber, Ann. Stat. 13, 435-475 (1985). 30. P. J. Huber, ‘Programs for exploratory projection pursuit, Artemis Systems, Carlisle (1989). 31. J. M. Chambers, W. S. Cleveland, B. Kleiner and P. A. Tukey, Graphical Methods for Data Analysis, Wadsworth, Belmont (1983). 32. R. A. Becker and W. S. Cleveland, in Dynamic Graphicsfor Statistics, ed. by W . S. Cleveland and M. E. McGill, pp. 201-224, Wadsworth, Belmont (1988). 33. R. A. Becker, W. S. Cleveland and G. Weil, in Dynamic Graphics for Statistics, ed. by W . S. Cleveland and M. E. McGill, pp. 247-275, Wadsworth, Belmont (1988). 34. G. A. F. Seber, Multivariate Observations, Wiley, New York (1984). 35. M. F. Devaux, E. M. Qannari and D. J. Gallant, J. Chemometrics, 6, 163-175 (1992). 36. R. Gnanadesikan, Methods for Statistical Data Analysis of Multivariate Observations, Wiley, New York (1977). 37. R. Sibson, J. R. Stat. SOC.,Ser B, 40, 234-238 (1978). 38. K. R. Gabriel, Biometrika, 58, 453-467 (1971). 39. A. Hoskuldsson, J. Chemornetrics, 2, 21 1-228 (1988). 40. Z. V. Lambert, A. R. Wildt and R. M. Durand, Psychol. Bull. 104, 282-289 (1988). 41. M. Stone and R. J. Brooks, J. R. Stat. SOC.,Ser. B, 52, 237-269 (1990). 42. I. T. Jolliffe, Appl. Stat. 31, 300-303 (1982). 43. J. H. Friedman, Ann. Stat. 19, 1-141 (1991). 44. M. P. Derde and D. L. Massart, Anal. Chim. Acta, 184, 33-51 (1986). 45. C. Weihs, W. Baumeister and H. Schmidli, J. Chemometrics, 7, 131-142 (1993). 46. G. Cruciani, M. Baroni, S. Clementi, G. Costantino, D. Riganelli and B. Skagerberg, J. Chemometrics, 6, 335-346 (1992). 47. W. J. Krzanowski, Principles of Multivariate Analysis, p. 69, Clarendon, Oxford (1988). 48. B. Efron and G. Gong, Am. Stat. 37, 36-48 (1983). 49. C. Weihs, Proc. 16th Conf. of the German Society of Classification, Dortmund, April 1992; also Technical Report 9204, Mathematical Applications, CIBA-GEIGY, Basel (1992). 50. H. Schmidli, Technical Report 9206, Mathematical Applications, CIBA-GEIGY, Basel (1992)

Lihat lebih banyak...

Multivariate exploratory data analysis and graphics: a tutorial

Descrição do Produto

Comentários