Spectral analysis of phylogenetic data

Share Embed


Descrição do Produto

Journal of Classification 10:5-24 (1993)

Spectral Analysis of Phylogenetic Data Michael D. Hen@ Massey University

David Penny Massey University

Abstract: The spectral analysis of sequence and distance data is a new approach to phylogenetic analysis. For two-state character sequences, the character values at a given site split the set of taxa into two subsets, a bipartition of the taxa set. The vector which counts the relative numbers of each of these bipartitions over all sites is called a sequence spectrtmt. Applying a transformation called a Hadamard conjugation, the sequence spectrum is transformed to the conjugate spectrum. This conjugation corrects for unobserved changes in the data, independently from the choice of phylogenetic tree. For any given phylogenetic tree with edge weights (probabilities of state chan~:e), we define a corresponding tree spectrum. The selection of a weighted phylogenetic tree from the given sequence data is made by matching the conjugate speclrum with a tree spectrum. We develop an optimality selection procedure using a least squares best fit, to find the phylogenetic tree whose tree spectrum most closely matches the conjugate spectrum. An inferred sequence spectrum can be derived from the selected tree spectrum using the inverse Hadamard conjugation to allow a comparison with the original sequence spectrum.

The authors wish to thank the three anonymous referees for their helpful comments on the first draft of this manuscript. Authors' Addresses: Michael D. Hendy, Department of Mathematics and Statistics, Massey University, Palmerston North, New Zealand (e-mail: [email protected]). David Penny, Department of Botany and Zoology, Massey University, Palmerston North, New Zealand (e-mail: [email protected]).

6

M . D . Hendy and D. Penny

A possible adaptation for the analysis of four-state character sequences with unequal frequencies is considered. A corresponding spectral analysis for distance data is also introduced. These analyses are illustrated with biological examples for both distance and sequence data. Spectral analysis using the Fast Hadamard transform allows optimal trees to be found for at least 20 taxa and perhaps for up to 30 taxa. The development presented here is self contained, although some mathematical proofs available elsewhere have been omitted. The analysis of sequence data is based on methods reported earlier, but the terminology and the application to distance data are new. Keywords: Phylogenetic trees; Bipartition; Hadamard transform; Hadamard conjugation; Spectrum; Nucleotide sequences; Distance data; Fast Hadamard Transform.

1. Introduction

Many methods of evolutionary tree inference from sequence or distance data follow the approach of: 1.

2.

Evaluating an optimality criterion for each possible tree or (when the number of trees is too large) for a small subset of the set of possible trees, selecting the optimal as the phylogeny; then Deriving lengths on the edges of that tree. These lengths may represent probabilities of character state change or numbers of events and might be correlated with time.

This reconstruction approach has difficulties because of the large number of potential trees to consider and because some of the measures (such as parsimony for sequence data and minimal tree for distance data) underestimate the edge lengths (Jukes and Cantor 1969). The required corrections to the edge lengths are non-linear and not local. Without these corrections, the wrong tree can be chosen (Felsenstein 1978, Hendy and Penny 1989). We develop the concept of spectral analysis as a tool for phylogenetic studies, where edge length corrections can be incorporated before the best fit tree is selected, reversing the order of procedures. In earlier papers (Hendy 1989, Hendy and Penny 1989, Hendy 1991) we introduced the use of a Hadamard matrix to derive an invertible mathematical relationship between a simple Poisson process on an evolutionary tree model of Cavender (1978) and the consequent set of character bipartitions at sites of homologous sequences of two-state characters. In this paper we extend the use of the Hadamard matrix for sequence data analysis with the introduction of a Hadamard conjugation and we develop a corresponding

Spectral Analysis o f Phylogenetic Data

7

application for distance data. The linear transformation, known as the Hadamard transform, is also used in image analysis; see for example Andrews (1970, Ch. 6-7) and in statistics where it is known as the Yates algorithm (Cooper 1968). H o w e v e r for a set o f n taxa, the Hadamard transform uses 4 "-1 operations. As with the Fourier transform, the Hadamard transform has an exponential order improvement, which is known as the Fast Hadamard Transform (FHT). T h e corresponding computation using the F H T requires (n - 1) x 2 "-1 operations. We need to introduce a simple modification to our earlier notation to exploit this.

2. Notation and Definitions A set N o f n elements has 2 n subsets. We will identify the elements o f N by the integers 1,2 . . . . . n, so that N = { 1,2 . . . . . n }. The 2" subsets o f N are:

So = Q ~ , S 1 = { 1 } , S 2 = {2},$3 = {1,2},$4 = {3},$5 = {1,3} . . . . .

$2._ 1 =N.

(A fuller list is given in Table 1.) These subsets are listed recursively, with i ~ Sj r the i-th binary digit o f j counting from the right, is 1. Thus for example S l l = {1,2,4} as 11 = 20 + 21 + 23 = 1011 in binary representation. Let m = 2 "-I , N" = {1,2 . . . . . n - 1} so that S(N') = {So,S1 . . . . . Sm-l } is the set o f subsets o f N ' . A bipartition (or split) o f the set N, is a pair {A,B} o f disjoint subsets whose union is N, that is A ~ B = O and A u B = N. (For completeness we include the pair {O,N} as a bipartition.) The number o f bipartitions o f N is m = 2 "-l . T h e r e is a natural bijection between S(N') and B(N), the set o f bipartitions o f N, where B(N) = {Bo,B l . . . . . Bm-I }, w i t h B i = { S i , N - S i } for Si ~ S(N'). (Thus we always list the subsets o f the bipartition {A,B } so that n c B. The index o f the bipartition is taken as the index o f the subset A which does not contain n.) W e will frequently refer to vectors x = (xo,xl . . . . . Xm-1) o f m real numbers, where the components xi are associated with the i-th bipartition. Such a vector will be called a spectrum. We also need to refer to sets which contain an even number o f elements. For any set S, the order o f S, written as I S I, is the number o f elements in S. An even ordered subset o f N is any subset with an even number o f elements. (This includes the empty set ~ with 0 elements.) The number o f even ordered subsets o f N is also m, where for S i c S(n ") we set:

8

M . D . H e n d y and D. Penny

Ei =

Si u {n} if ISil i s o d d Si otherwise.

This defines a natural bijection between S ( N 3 , the set o f subsets o f N" and E(N) = {Eo,E I . . . . . Em-I }, the set o f even ordered subsets o f N . n In Table 1 we list the elements of Si ~ S(N'), B i c B(N) and Ei ~ E(N), for n = 5. A phylogenetic tree (labelled by N) is a tree where n o f its vertices, v l,vz . . . . . v,, are indexed by N and all its unlabelled vertices are branching vertices with degree at least 3. In any phylogenetic tree labelled by N, there are at most n - 2 branching vertices and 2n - 3 edges. A fully resolved (or binary) tree is a tree where the labelled vertices are each o f degree 1 (pendant), while the branching vertices are each o f degree 3. In this case there are precisely n - 2 branching vertices and 2n - 3 edges. W e use the following scheme to index the edges o f a tree T. The deletion o f an edge e from T will create two subtrees, L (left) and R (right), oriented so that vn is a vertex o f R. Let S(L) and S(R) be the indices o f the indexed vertices o f L and R respectively. {S(L), S(R)} is a bipartition B i o f N, for some i < m. We call B i the edge bipartition induced by e. n ~ S(R) ~ S(L) c N', so for some i < m, S(L) = Si and we label e as ei. Let e(T) be the set o f edges o f T and let B(T) be the set o f edge bipartitions over all edges o f T. In figure 1 we illustrate the edge indices o f a particular phylogenetic tree. We define a weighted phylogenetic tree as a phylogenetic tree T, with a real value wi associated with each edge e i o f T. These values, which we refer to as edge weights or edge lengths, will be restricted to positive values. We extend the set o f edge weight values to a vector w ~ R m with components w; > 0 for each edge ei o f T, setting wj = 0 for the other indices 0 < j 0 for all edges e i of T and qj = 0 for the remaining indices j > 0. These relations derived from the s values are similar to the invariants introduced in recent work of Cavender and Felsenstein (1987) and Sankoff(1990). For the case o f n = 4, combinations of q3, q5 and q6 can be

14

M . D . Hendy and D. Penny

shown to be related to the three invariant functions K1, K2 and K3 o f Cavender and Felsenstein, for example q3 = 0 and q5 = 0 r K3 = 0. However we cannot expect an observed sequence spectrum to fit exactly. We define the vector y = H -1 In (Hs) as the conjugate spectrum (of s), derived by applying the Hadamard-log conjugation to s. For an exact fit 7 = q, the tree spectrum of some tree T. In general however, we need to employ a fitting procedure to find an approximating tree spectrum q approximating 3f. We can apply the Hadamard-exponential conjugation to this inferred tree spectrum q to obtain the inferred sequence spectrum ~, which could be compared with the original sequence spectrum s. In the next section we introduce one possible fitting procedure which finds the closest (Euclidean distance) tree spectrum q to 7. We can generalize the above formulae to the case where the ratio of frequencies of character values X and Y in the sequences is a ' b (with a + b = 1). We assume this ratio is stable, that is assume the vector (a,b) is a fixed point of the process. With o~ = 4ab we find the probability of obtaining different character values at the endpoints of a path of length t is p = l / 2 c z ( 1 - e xt/2a) for some constant ~. We define the functions LNcL(x) = o~ln(1 + x - 1 ) and EXPa(x) = 1 + o~(ex/a - 1) and find the relact tionship between q and ~ becomes 1 m-I

sk = -

1 m-1

ra-1

~., hjk EXPa( ~ hijqi) and m j:o

i=0

qi

=-

m-1

Z hij L N e t ( ~ hjks,).

m j:0

k=0

(Note, when a = b = 89 ot = 1. LNl(X) = In(x), EXPI(x) = e x, so that the two pairs of formulae are consistent.) 5. Closest Tree Selection For a given phylogenetic tree, the spectral set of T, W(T), is the set of all possible tree spectra on T, i.e., all w values where wi > 0, for each edge ei of T. The spectral space, W(N), is the union of the spectral sets over all phylogenetic trees labelled by N. W(N) is a subset of the hyperplane m-I

W = {w I Z wi

=0} CRm.

i=o

A point y of W is in W(N) r 7 / > 0 V i > 0 and for every i, j e N with ~'i and "fj >0, B i and Bj are compatible. Given any point y in W, we need to find the point w e W(N) which best approximates 7 in some way. The closest tree selection procedure selects the weighted tree (T,w), such that w e W(T) where w is the closest point (by Euclidean distance) in W(N) to 7.

Spectral Analysis of Phylogenetic Data

15

For any given point 2( of W and given phylogenetic tree T with e edges, let

1 2(r- e + l

(2(0+

~

2(i).

ei ~ e ( T )

The closest point to 2( in the spectral space of T can be shown to be w = w(T), m-I

where wi = 2(i - 2(r for e i ~ e ( T ) (provided 2(i > 2(r); w0 = - ~ wi, and wi = 0 i=1

otherwise (Hendy 1991). We define the least squares bestfit error of 2( to the spectral space of T to be F2(Z2() = Iw(T)-3,12 =

E

2(/2 + (e + 1)(2(T)2 .

e~ ~ e(T); i > 0

(this is the squared Euclidean distance o f t to W(T).) The closest tree (,w) to e(T); i > 0 2( is defined to be the tree T for which F2(T,2() is minimal. This minimization is easily adapted to a branch and bound algorithm such as TurboTree (Penny and Hendy 1987). e i g.

6. Distance Data We can also derive a spectrum from distance data which we will call the distance spectrum. This is equivalent to the conjugate spectrum from sequence data and similarly can be used to infer a tree spectrum using the closest tree selection procedure. Suppose we have a weighted tree (T,w), with tree spectrum w. (Thus the weights Wk > 0 for each edge ek of T.) The tree distance between the pair of vertices vi and vj is the sum of the weights on each edge of the path connecting them. As we saw in Section 3, the tree distances can be determined using the Hadamard transform with t = - ~ Hw, where for Ek = {i,j }, tk is the distance between vi and vj. The inverse problem is more dill]cult. Given a set of n objects and a set of pairwise distances dij > O, V i, j ~ N, we would like to determine the weighted tree (T,w) that best fits these distances. If the distances dij are additive, that is if they are exactly the tree distances from some weighted tree (T,w), then these are easily determined by existing methods, such as UPGMA (Sneath and Sokal 1973, Ch. 5.5). However for real data, we cannot expect an exact fit and therefore we must employ some "best fit" selection of a weighted tree (T,w). One possible procedure is to estimate the tree spectrum from the set of distances. Often the measured distances between the objects underestimate the actual tree distances we are trying to determine and a nonlinear correction for this, such as that considered by Felsenstein (1987) could be incorporated.

16

M . D . Hen@ and D. Penny

Before we can invert t = - 89Hw, we must interpret the values of tk, where Ek contains four or more elements. For any ordered subset Ek, t, is the sum of the weights wi over all edges of the path set Ilk. When the distances are additive, this sum can be determined iteratively, without reference to T. min Let D(Eo) = O, D({i,j}) = dij and D(Ek) = i # j ~ Ek (dij + D(E, - {i,j})). (This construction picks out the edges in the disjoint union 1-It V Ilk-, where

El = {i,j}, Ek" = Ek - E t . In computing D(Ek), we can fix i as any element of E, and only need to consider each j ~ E, - {i}.) We find tk = D(Ek) is the sum of the edge weights of the edges of Ilk. For non-additive data we must assume that the above procedure gives an approximation t to the tree spectrum. We refer to t as the distance spectrum. We find that is the distances are " a l m o s t " additive, then the distance spectrum will be " c l o s e " to the tree spectrum. We apply the closest tree procedure to find the best fit (T,w) to the distance spectrum t, as described in Section 3. From (T,w) we can calculate the inferred distance vector d = - 89 Hw, from which we can determine the individual inferred d i s t a n c e s dij = tk, where Ek = {i,j }. Although the closest tree fit for the distance spectrum is a least squares best fit, it differs from the least squares algorithm of De Stere (1983) which minimizes E. (dij - dij) 2. In De Soete's formula, the erros in the lengths of l

~

0

z

,<

Figure 4. The weighted phylogenetic tree (T,q) inferred from the distance spectrum of Figure 3.

Spectral Analysis of Phylogenetic Data

19

TABLE 2 Immunological Distances and the corresponding Inferred Distances from the Tree of Figure 4 DOG DOG BEAR 33.15 RACCOON 46.80 MINK 49.13 SEAL 48.95 SEALION 47.35 CAT 97.50 MONKEY 151.40

BEAR RACCOON /vlJ2qK SEAL SEALION 32 25.69 31.93 31.75 30.15 80.29 134.20

48 26 45.59 45.40 43.81 92.93 137.86

51 34 42

50 29 44 44

41.57 39.98 24.13 86.77 89.93 140.68 143.84

48 33 ~ 38 24 88.34 142.25

CAT MONKEY 98 84 92 86 89 90

1~8 136 152 142 142 I42 148

148.13

7. Sequence Data (four-state character sequences) With nucleotide sequences there are four character states, A, C, G and T for DNA and A, C, G and U for RNA. We have considered (Penny et al. 1990) several ways of adapting the two-state character analysis above to handle four characters. Our spectral analysis is based on the relationship between the bipartitions from the sequence data and the edge bipartitions of trees. With four-state characters the sequence partitions have up to four subsets. Unfortunately there does not appear to be a simple mechanism for relating these partitions to the edge bipartitions. One approach we have used is to map four state character sequences to two state character sequences in several ways and then combine the resultant conjugate spectra. For each character state X we consider the bipartitions of the form X/-X, where - X combines the remaining three states. Thus we can separately calculate the conjugate spectra for A/-A, C/-C, G / - G and T(U)/-T(U). In each case we use the modified Hadamard conjugations with o~ = 4x(1 - x ) , where x is the relative frequency of the character X in the sequences over all taxa. Different ways of combining these spectra need to be investigated. One method we have used is to normalize each of the spectra, m-1

so that ]~ Yi = 1, (which is readily achieved by dividing 7 by - 7 0 ) and then i=1

take the average of the four normalized 7 vectors as a combined conjugate spectrum. We then use this combined spectrum for the least squares best fit to infer the closest tree spectrum. In Example 3 we illustrate this process with some sequence data of seven taxa. Example 3: The data are seven DNA sequences selected from Lake (1987). The taxa are human(l), rat(2), frog(3), brine shrimp(4), trypanosome(5), euglena(6) and maize(7). The relative frequencies of the nucleotides A, C, G,

20

M.D. Hendy and D. Penny

la)

~_i ............. I..........1 Figure 5. (a) The sequence spectrum for A / - A derived from the data of example 3; (b) The conjugate spectrum for A / - A ; (c) The average conjugate spectrum for the seven taxa of example 3; and (d) The inferred sequence spectrum from the closest tree (figure 6) derived from these data.

Spectral Analysis of Phylogenetic Data

21

48 .128 16 .272 32 .232

63 .066

7 }.025 .09t .039

1 '.003

41.0248

2 1.011

--"

0"~

"<

3 Figure 6. The weighted phylogenetic tree (T, q) inferred from the average spectrum of Example 3.

22

M.D. Hendy and D. Penny

TABLE 3 2 1 1 Computation of ~2 = - ~Hw and w = - mHr = - ~Hr using the Fast Hadamard Transform.

x

x0

Xl

x2

x3

x4

x5

x6

x7

w:

-25 -18 -10 0 0 16 42 100 -25

7 -32 -30 -32 16 -16 -14 -28 7

5 8 -26 -28 14 26 -10 -20 5

3 2 -34 -24 12 2 -18 -12 3

4 4 10 -20 10 24 58 -16 4

0 4 -2 -28 14 -4 -14 0 0

0 6 -2 -24 12 34 -10 0 0

6 -6 10 -44 22 -10 6 -24 6

Hw: r:

Hr: w:

T are: .2392, .2406, .2353, .2848. In Figure 5 we give the sequence and conjugate spectra for the bipartitions induced by A / - A and the average conjugate spectrum and inferred sequence spectrum derived from the closest tree. The weighted phylogeny of Figure 6 is the closest tree derived from the average conjugate spectrum for these taxa. 8. T h e F a s t H a d a m a r d

Transform

If we used standard matrix multiplication to form the product 1-Ix, where H has m = 2 m-1 rows and columns, we require O(4 n) steps and components of storage. This is prohibitively expensive for all but small (n < 10) values. However there is an exponential order speed up, with a corresponding reduction in storage requirements, called the fast Hadamard transform (FHT). (See, for example, Whelchel and Guinn 1968.) The form of the FHT we use requires exactly (n - 1) x 2n-I additions and multiplications and uses a single array of 2 '~-1 components. To multiply y = H x , where x = (x0,xl . . . . . Xm-l) using exactly (n - 1) x 2 "-I steps we successively replace a pair of elements xi and xj with their sum xi + xj and difference x i - x j, where for i < j, xi ~---xi + x j, xj ~-- xi -xj. In the first stage this replacement is applied to distinct adjacent pairs, in the second stage to distinct pairs 2 apart and then to pairs 4, 8, ...apart, so that at the i-th stage the pairs are 2 i-1 apart. After applying 2 n-1 stages the resultant vector is y. These calculations are illustrated in Table 3 _

2

where we compute p = - 21 Hw and w = - -m- H r using the values of Exampie 1.

Spectral Analysis of Phylogenetic Data

23

References ANDREWS, H. C. (1970), Computer Techniques in Image Processing, New York: Academic Press. CAVENDER, J. A. (1978), "Taxonomy with Confidence," Mathematical Biosciences, 40, 271-280. CAVENDER, J. A., and FELSENSTEIN, J. (1978), "Invariants of Phylogenies: Simple Cases with Discrete States," Journal of Classification, 4, 57-71. COOPER, B. E. (1968), "The Extension of Yams' 2" Algorithm to any COmplete Factorial Experiment," Technometrics, 10, 575-577. DE SOETE, G. (1983), " A Least Squares Algorithm for Fitting Additive Trees to Proximity Data," Psychometrika, 48, 621-626. FARRIS, J. S. (1972), "Estimating Phylogenetic Trees from Distance Matrices," American Naturalist, 106, 645-668. FARRIS, J. S. (1978), "Inferring Phylogenetic Trees from Chromosome Inversion Data," Systematic Zoology, 27, 275-284. FELSENSTEIN, J. (1978), "Cases in which Parsimony or Compatibility Methods will be Positively Misleading," Systematic Zoology, 27, 401-410. FELSENSTEIN, J. (1987), "Estimation of Hominoid Phylogeny from a DNA Hybridization Data Set," Journal of Molecular Evolution, 26, 123-131. HADAMARD, J. (1893), "Resolution d'une question relative aux determinants," Bulletin des Sciences Mathematiques Series 2, 17, 240-246. HEDAYAT, A., and WALLIS, W. D. (1978), "Hadamard Matrices and their Applications," Annuls of Statistics, 6, 1184-1238. HENDY, M. D., and PENNY, D. (1982), "Branch and Bound Algorithms to Determine Minimal Evolutionary Trees," Mathematical Biosciences, 59, 277-290. HENDY, M. D., amd PENNY, D. (1989), " A Framework for the Quantitative Study of Evolutionary Trees," Systematic Zoology, 38, 297-309. HENDY, M. D. (1989), "The Relationship Between Simple Evolutionary Tree Models and Observable Sequence Data," Systematic Zoology, 38, 310-321. HENDY, M. D. (1991), " A Combinatorial Description of the Closest Tree Algorithm for Finding Evolutionary Trees," Discrete Mathematics, 96, 51-58. JUKES, T. H., and CANTOR, C. H. (1969), "Evolution of Protein Molecules," in Mammalian Protein Metabolism, Ed. H. M. Munro, New York: Academic Press, 21-123. LAKE, J. A. (1987), "Prokaryotes and Archaebacteria are not Monophyletic: Rate Invariant Analysis of rRNA Genes Indicates that Eukaryotes and Eocytes From a monophyletic Taxon," Cold Spring Harbor Symposia on Quantitative Biology, 52, 839-846. LAKE, J. A. (1987a), " A Rate-Independent Technique for Analysis of Nucleic Acid Sequences: Evolutionary Parsimony," Molecular Biology and Evolution, 4, 167-191. PENNY, D., and HENDY, M. D. (1987), "TurboTree: A Fast Algorithm for Minimal Trees," Compute r Applications in the Biosciences, 3, 183-188. PENNY, D., HENDY, M. D., ZIMMER, E. A., and HAMBY, R. K. (1990), "Trees from Sequences: Panacea or Pandora's Box?" Australian Journal of Systematic Botany, 3, 21-38. SANKOFF, D. (1990), "Designer Invariants for Large Phylogenies," Molecular Biology and Evolution, 7, 255-269. SARICH, V. M. (1969), "Pinniped Origins and the Rate of Evolution of Carnivore Albumins," Systematic Zoology, 18, 286. SCHROEDER, M. R. (1986), Number Theory in Science and Communication, 2nd ed., Berlin: Springer-Verlag.

24

M . D . Hendy and D. Penny

SNEATH, P. H. A., and SOKAL, R. R. (1973), Nl~,nerical Taxonomy, San Francisco: W. H. Freeman. STEEL, M. A. (1989), Distributions on Bicoloured Evolutionary Trees, Ph.D. thesis, Massey University, Palmerston North. WHELCHEL, J. E., and GUINN, D. F. (1968), "The Fast Fourier-Hadamard Transform and its Use in Signal Representation and Classification," Eascon 1968 Convention Record, 561-573.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.