Progress in
PERGAMON
Progress in Biophysics & Molecular Biology 70 (1998) 175±222
Biophysics & Molecular Biology
Arti®cial neural networks for computer-based molecular design Gisbert Schneider a, *, Paul Wrede b a
F. Homann-La Roche Ltd., Pharmaceuticals Division, Molecular Design and Bioinformatics, CH-4070 Basel, Switzerland b Freie UniversitaÈt Berlin, UniversitaÈtsklinikum Benjamin Franklin, Institut fuÈr Medizinische/Technische Physik und Lasermedizin, AG Molekulare Bioinformatik, Krahmerstraûe 6±10, D-12207 Berlin, Germany
Abstract The theory of arti®cial neural networks is brie¯y reviewed focusing on supervised and unsupervised techniques which have great impact on current chemical applications. An introduction to molecular descriptors and representation schemes is given. In addition, worked examples of recent advances in this ®eld are highlighted and pioneering publications are discussed. Applications of several types of arti®cial neural networks to compound classi®cation, modelling of structure±activity relationships, biological target identi®cation, and feature extraction from biopolymers are presented and compared to other techniques. Advantages and limitations of neural networks for computer-aided molecular design and sequence analysis are discussed. # 1998 Elsevier Science Ltd. All rights reserved. Keywords: Combinatorial chemistry; Compound library; Drug design; Genetic algorithm; Molecular diversity; Molecular descriptor; Neural network; Sequence analysis; Structure±activity relationship
Abbreviations: 2D, two-dimensional, 3D, three-dimensional, ANN, arti®cial neural networks, ART, adaptive resonance theory, BAM, bi-directional associative memory, Bp, back-propagation-of-errors, CAMD, computer-aided molecular design, CPN, counter-propagation network, LMS, least-mean-square, PCA, principle component analysis, QSAR, quantitative structure±activity relationship, RBF, radial basis function, ReNDeR, reversible non-linear dimension reduction, SAR, structure±activity relationship, SBB, structural building blocks, SNN, supervised neural networks, UNN, unsupervised neural networks. * Corresponding author. Tel.: +41-61-6870696; fax: +41-61-6887408; e-mail:
[email protected]. 0079-6107/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved. PII: S 0 0 7 9 - 6 1 0 7 ( 9 8 ) 0 0 0 2 6 - 1
176
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
1. Introduction Intelligent algorithms for computer-aided molecular design (CAMD) have become an important part of lead discovery and optimization, biological target identi®cation, protein and nucleic acid design, and assessment of data base diversity (Hunter, 1993; Wrede and Schneider, 1994; Jackson, 1995; van de Waterbeemd, 1995a,b; Blundell, 1996; BoÈhm et al., 1996; Willett, 1997; Young et al., 1997). Various methodological CAMD approaches help speed up the drug development process either by planned alteration of molecular template structure(s) or by de novo design (MuÈller, 1995). Most of today's CAMD methods are based on knowledge of three-dimensional molecular structures. In this work we highlight new techniques for molecule design and molecular feature extraction, which can be applied when three-dimensional molecular structures are unknown. A necessary prerequisite for any rational attempt to identify or even design molecules with a desired property or activity is an accurate model of the underlying structure±activity relationship (SAR) (Hansch and Leo, 1995). Such SAR models serve as a guideline in the search for novel or optimized compounds in evolutionary design cycles which have become possible due to advances in both compound generation and screening technology. It is obvious that the quality of the model determines the success rate of this multi-dimensional design process. Only if a relevant SAR model is used, e.g. a particular pharmacophore hypothesis, can rational molecular design be successful. How can we develop a good SAR model? It is apparent that no cure-all recipe exists, nevertheless some general rules of thumb can be given. One approach is to consider the task as a pattern recognition problem, where three main aspects must be considered: ®rst, the data used for generation of a SAR hypothesis should be representative of the particular problem; second, the way molecular structures are described for model generation and its level of abstraction must allow for a reasonable solution for the pattern recognition task; and third, the model must permit non-linear relationships to be formulated since the interdependence between molecular activities and structural entities is generally non-linear. The ®rst point seems to be trivial but selection of representative data for hypothesis generation is very dicult and often impossible due to a lack of data. This article mainly focuses on a discussion of the two latter points, namely dierent levels of data representations and descriptor types, and nonlinear feature extraction from a given data set by arti®cial neural networks (ANN). Various types of ANN are of considerable value for many ®elds of research, including chemistry, biology, medicine, and pharmaceutical research. Main tasks performed by these systems are . . . .
feature extraction; function estimation and non-linear modelling; classi®cation; prediction.
For many applications alternative techniques exist (Milne, 1997); ANN provide, however, an often more ¯exible and elegant approach oering unique solutions to these tasks. The paradigms oered by ANN lie somewhere between purely empirical and ab initio approaches. Neural networks
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
. . . .
177
`learn' from examples and acquire their own `knowledge' (induction); are able to generalize; provide ¯exible non-linear models of input/output relationships; are able to cope with noisy data and are fault-tolerant.
Here we review some of the pioneering work in this ®eld highlighting the development of ANN for (Q)SAR modelling and molecular design which has dramatically gained momentum during the last several years (Bradley, 1997). A special emphasis will be on ANN for feature extraction from nucleic acid and protein sequences (Section 4.1) which can help in ®nding new drug targets and enlarging today's target spectrum (Drews and Ryser, 1997). 2. Principles of neural network development Arti®cial neural networks have found a widespread use for classi®cation tasks and function approximation in many ®elds of chemistry and bioinformatics (Table 1). For these kinds of data analysis mainly two dierent types of networks are employed, `supervised' neural networks (SNN) and `unsupervised' neural networks (UNN). The main applications of SNN are function approximation, classi®cation, pattern recognition and feature extraction, and prediction tasks (Table 1). These networks require a set of molecular compounds with known activities to model structure±activity relationships. In an optimization procedure, which will be described below (Section 2.3), these known `target activities' serve as a reference for SAR modelling. This principle coined the term `supervised' networks. Correspondingly, `unsupervised' networks can be applied to similar tasks even without prior knowledge of molecular activities. These systems can perform automatic feature extraction and data Table 1 Neural network types frequently used in the life sciences (adapted from Sumpter et al., 1994) Network type/architecture Supervised Multilayer feed-forward (bp)
Main applications
Recurrent networks Encoder networks (ReNDeR) Learning vector quantization
non-linear modelling of (Q)SAR, prediction of molecule activity and structure, pattern recognition, classi®cation, signal ®ltering, noise reduction, feature extraction sequence and time series analysis data compression, factor analysis, feature extraction auto-associative recall, data compression
Unsupervised Kohonen self-organizing map Hop®eld networks Bidirectional associative memory (BAM) Adaptive resonance theory (ART) models
clustering, data compression, visualization auto-associative recall, optimization pattern storage and recall (hetero-association) clustering, pattern recognition
Hybrid Counterpropagation networks Radial basis function (RBF) networks Adaptive fuzzy systems
function approximation, prediction, pattern recognition function approximation, prediction, clustering similar to ART and bp-networks
178
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
classi®cation. An introduction to UNN development is provided in chapter 2.2. Hybrid systems have also been developed and successfully applied to various pattern recognition and modelling tasks (Section 2.4). In Table 1 the main network types and characteristic applications in the life sciences are listed. We restrict the following description of neural networks to the most widely applied in the ®eld of molecular design and analysis of protein and nucleic acid sequences. More extensive introductions to the theory of neural computation including comparisons with statistical methods are given in various textbooks (e.g., Rumelhart et al., 1986; Hertz et al., 1991; Amari, 1993; Bishop, 1995; Ripley, 1996). There are many books on neural networks covering parts of the subject not discussed here, e.g. the relation of neural network approaches to machine learning (Mitchie et al., 1994; Bratko and Muggleton, 1995; Langley and Simon, 1996; Langley, 1996), fuzzy classi®ers (Kosko, 1992), time series analysis (Weigend and Gershenfeld, 1993), and special aspects of pattern recognition (Carpenter and Grossberg, 1991). Ripley provides an in depth treatment of the relation of neural network techniques to Bayesian statistics and Bayesian training methods (Ripley, 1996). A commented collection of pioneering publications in neurocomputing has been compiled by Anderson and Rosenfeld (1988). Recently, two texts have become available covering both theory and applications of ANN in chemistry, QSAR, and drug design (Zupan and Gasteiger, 1993; Devillers, 1996a). Frequently data sets must be investigated for homogeneity. A typical example is the investigation of a compound library for clusters of similar molecules (or performing a `diversity' analysis) (Downs and Willett, 1995; Willett, 1997). UNN are able to automatically recognize clusters of data in a given set based on a rational representation of the compounds and a sensitive similarity measure. What is meant by `similarity' and `diversity' must be carefully de®ned since the selection of an appropriate measure of similarity is crucial for the clustering results (Good et al., 1993a,b; Dean, 1995; Sen, 1995a,b). The term `unsupervised' stresses the fact that no search template is required for this application, i.e. the network automatically extracts data features, and no knowledge or de®nition of underlying attributes is required. These systems can be helpful in generating an overview of the distributions of the data sets, and they are, therefore, especially suited for a ®rst examination of a given data set or raw data analysis. In Fig. 1a a distribution revealing two main clusters is shown which might represent sets of active and inactive molecular compounds. Such inhomogeneities can be easily detected by UNN. Some types of UNN are able to form a comprehensive model of the real data distribution which can be helpful for extraction and an understanding of predominant data features, e.g. molecular structures or properties that are responsible for a certain biological activity. In contrast to UNN, the structuring of a data set must be already known to apply SNN. These systems are able to approximate arbitrary relationships between data points and their functional or classi®cation values, i.e. they can be used to model all kinds of input±output relationships (Fig. 1b) and classify data by establishing a classi®cator function (Fig. 1c). The term `supervised' network indicates that in contrast to UNN for every data point one or more corresponding functional values must be already known (these might be experimentally measured molecule activities).
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
179
Fig. 1. (a) Data distribution revealing two major clusters; (b) modelling a mathematical function f(x) to establish an input-output relationship by SNN; (c) a classi®cator (line) formed by a SNN separating two classes of data (shown as circles and squares).
2.1. Building blocks of neural network architecture Arti®cial neural networks consist of two elements, (i) formal neurons, and (ii) connections between the neurons. Neurons are arranged in layers, where at least two layers of neurons (an input layer and an output layer) are required for construction of a neural network. In Fig. 2 three network architectures are shown, both a two-layered network (Fig. 2a) and a threelayered network (Fig. 2b) with a single output neuron, and a two-layered network with many output neurons (Fig. 2c). In these three architectures every neuron of one layer is connected to every neuron of the following layer, and no intra-layer connections exist. This property coined the term `fully connected feed-forward networks'. Sometimes they are referred to as `multilayered perceptrons' although the classical perceptron contains only a single neuron (Rosenblatt, 1958; Minsky and Papert, 1969). Fully connected feed-forward networks are by far the most frequently used neural networks for molecular design and bioinformatical applications. Formal neurons transform a numerical input to an output value, and the neuron connections represent numerical weight values. The weights and the neurons' internal variables (termed bias or threshold values) are free variables of the system which must be determined in the so-called `training phase' of network development. Selected feed-forward network types and appropriate training algorithms are discussed in the following. For details on other network architectures, see the literature (Hertz et al., 1991; Bishop, 1995; Ripley, 1996). 2.2. Unsupervised training of neural networks Fig. 2c shows the architecture of a two-layered, fully-connected feed-forward net. This architecture is often used for data analysis by UNN. However, a number of dierently structured UNN architectures have been described for speci®c applications (Ritter et al., 1990). A special type of UNN are self-organizing feature maps or Kohonen-networks with their output layer neurons arranged as an array (Kohonen, 1982, 1984, Fig. 2d). In UNN both the number of input layer neurons and the number of weights connected to an output neuron are identical to the dimension of the input data (for example, molecular compounds represented by a number of descriptors). The input layer neurons simply distribute the data values to the
180
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
Fig. 2. Dierent architectures of feed-forward networks. Formal neurons are drawn as circles, weights are represented by lines connecting the neurons. The ¯ow of information through the networks is from top to bottom. White circles: fan-out neurons; gray circles: sigmoidal neurons; black circles: linear neurons. (a) Perceptron; (b) three-layered feed-forward network; (c + d) two-layered feed-forward network.
output neurons (`fan-out' units), and the output neurons compute a value of 0 or 1 depending on their input. Output neurons represent possible data clusters or categories. The number of clusters considered is identical to the number of output layer neurons. Only a single output neuron computes the value 1 for a given data point at a time, all other output layer neurons have an output value of 0. By this the data point under investigation is assigned to a cluster. The decision which output neuron ®res is made by comparing the data vector with all of the neurons' weight vectors, and the output neuron with the most similar weight vector to the data input is selected. The network weights provide the basis for data classi®cation. Weight vectors must be determined by an optimization procedure during the training phase. The process of `unsupervised' network training is similar to vector quantization (Fig. 3) (Nasrabadi and King, 1988; Kohonen, 1989; Hertz et al., 1991). Initially the weights have arbitrary values (they should, however, not be too dierent from the data values to facilitate convergence of the training process). Then, a cyclic optimization procedure takes place, termed `competitive' learning:
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
181
step 1: randomly select a data point, x, from the data set to be classi®ed; step 2: determine the output neuron with the weight vector w closest to the data point (`winner neuron' w * ): i
jw * ÿ xjRjwi ÿ xj
for all i; i
step 3: update the weight values of the winner neuron according to the following prescription: Dw * Z
x j ÿ w * , i j
i j
where Z is a learning step size which usually is a function of time, Z(t). Conditions on Z(t) needed to ensure convergence are discussed in the literature (Cottrell and Fort, 1986; Ritter and Schulten, 1988); step 4: go to step 1 or stop training (e.g., if the value of Z is below a critical threshold, or if a certain number of cycles has passed). This `winner-takes-all' strategy forces the weight values of the network to move towards centroids of the data distribution and become a set of prototype vectors (Fig. 3). All data points located within the `receptive ®eld' of an output neuron will be assigned to the same cluster. The receptive ®elds of a fully trained UNN are comparable to the areas de®ned by Voronoi tessellation (or Dirichlet tessellation) of the input space (Hertz et al., 1991; Preparata and Shamos, 1985). All data which are closer to the weight vector of one neuron than to any other weight vector belong to its receptive ®eld. The criterion employed to ®nd the winner neuron can either be a distance between a weight vector, w, and a data vector, x. Frequently used are the euclidian distance: s X d
wi ÿ x i 2 , i
Fig. 3. Principle of vector quantization by unsupervised network training. The network weight vectors (®lled arrowheads) move towards centers of the data distribution during training. Location of the weight vectors before (a) and after training (b). Data points are drawn as vectors with open arrowheads. Here two clusters are formed by two neurons; 1 and 2 are two dimensions of the data space.
182
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
or the Manhattan distance: X jwi ÿ x i j, d i
The complementary approach is to determine the neuron leading to the maximal formal output, where X output
wi x i : i
As with similarity measures in statistical clustering, for example k-nearest neighbor methods, the de®nition of similarity between data points (or weight vectors, respectively) has a great in¯uence on the outcome of the analysis. The optimum number of output neurons is not known a priori and the choice very much depends on the knowledge available about the original data distribution. Sometimes it is reasonable to expect only a limited number of data classes like `active compounds' and `inactive compounds'. There are also simple methods that can be used to automatically adapt the size of the output layer. For example, a new output neuron may be added located within the receptive ®eld of an already existing neuron if data belonging to its receptive ®eld are less similar to the weight vector than a minimum threshold; an output neuron might be removed if it represents only a single data point or all data belonging to its receptive ®eld are more similar to the weight vector than a maximum threshold. Currently much eort is spent in the development of techniques for optimizing UNN architectures. A useful modi®cation of the `winner-take-all' algorithm described above is to allow more than just one single output neuron to adapt its weights during a training cycle as suggested by Kohonen (1982, 1984, 1989). In a Kohonen-network the output layer neurons are arranged in a plane or another low-dimensional geometry. During training not only the winner neuron weights are updated but also neurons located close to the winner neuron in the output layer, de®ned by a neighborhood-function. As a consequence, the topology of the usually highdimensional input space is conserved in the low-dimensional space spanned by the network output neurons. Analysis of the features extracted by the network and their distribution in the neuron map can help getting an idea of the structuring of the training data. Simple examples of UNN training by the Kohonen-algorithm are shown in Fig. 4. Two-layered networks containing two input neurons and 10 10 output neurons arranged in a plane were trained by 100 randomly distributed data points (Fig. 4a) and by ten data points roughly forming the shape of a heart (Fig. 4b). The network architecture is shown in Fig. 2d. In the example the two-dimensional input space was mapped onto a plane spanned by the output neurons. After training the network topology re¯ects the corresponding data distribution. For visualization of the network development process in Fig. 4 the output neurons were connected by lines, where the co-ordinates of the intersections are given by the corresponding weight vectors. Besides data clustering Kohonen-networks are useful tools for data visualization. For example, they were applied to mapping the electrostatic surface potential of molecules onto a plane (Gasteiger and Li, 1994; Barlow, 1995). This can be useful for visual inspection of potential pharmacophores and comparison of potential drug candidates. An example of
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
183
Fig. 4. Examples of topology-preservation in Kohonen-networks. (a) training of a 10 10 Kohonen-network with randomly distributed data; (b) training of a 10 10 Kohonen-network with only a few data points roughly forming the shape of a heart. Stages of network development during the training process are shown. At the end of UNN training the network topologies re¯ect the underlying data distributions.
visualizing a model of the chemical space covered by a molecular compound library using Kohonen-mapping is presented in Section 4.2. Several modi®cations of the Kohonen algorithm have been developed to improve the original method and overcome some limitations (e.g., Ritter and Schulten, 1988, 1989; Martinetz and Schulten, 1994). Of particular interest are special neighborhood functions in the output layer of the Kohonen-network to increase contrast between neuron clusters (`Mexican hat' function), and to use a toroidal geometry of the output layer (Zupan and Gasteiger, 1993). The latter can help avoiding erroneous data classi®cations due to border eects of a planar topology (Li et al., 1993), and is especially useful for mapping a continuous input space, e.g. a molecular electrostatic potential (Zupan and Gasteiger, 1993; Holzgrabe et al., 1996). A disadvantage of Kohonen-networks can be the comparatively long training time needed, especially if large data sets are used since every data vector must be presented several times to the network for weight adaptation. Hybrid multi-layered UNN which can be trained extremely fast employing very large data sets have already been developed (Lu et al., 1990a,b; Hertz et al., 1991; Lu and Lerner, 1996). These systems can provide an alternative classi®cation tool to Kohonen-networks if real-time or on-line computation is required, e.g. for control and analysis of high-throughput screening results. Usually they contain more than two layers of neurons where from layer to layer a more subtle data classi®cation is performed, and only parts of the
184
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
network are considered during one training cycle. This reduces the training time needed since a data vector is not compared to every weight vector as in classical Kohonen-networks. Until today, however, no application of these modi®ed modular systems to molecular design has been published. An overview of early Kohonen-network applications in chemistry together with a treatment of Hop®eld networks has been compiled by Melssen and co-workers (Melssen et al., 1994). A comparison of unsupervised Kohonen-networks with the supervised learning vector quantization approach was performed by Marabini and Carazo (1994) taking the classi®cation of images of biological macromolecules as an example.
2.3. Supervised training of neural networks SNN can be used as function estimators (Cybenko, 1988; Hertz et al., 1991) and classi®cators (Minsky and Papert, 1969). They follow the principle of convoluting simple nonlinear functions for approximation of complicated input±output relationships (Kolmogorov's theorem, Kolmogorov (1957)). Fourier transforms, for example, are based on a similar idea using super-positions of Sin and Cos terms. This general concept has been thoroughly discussed by Bishop (1995). Since in SNN mainly sigmoidal functions are employed the following description of these systems will be restricted to this network type. Examples of networks using other neuron activities have been described and ®rst applications to molecular design exist (see below). We restrict the discussion of SNN to feed-forward networks because of their dominating role in chemistry and biology. Other network architectures, e.g. recurrent networks for time-series and sequence analysis, are not considered here. Like UNN, a supervised neural network architecture consists of neurons and connection weights. However, in contrast to UNN all neurons in a supervised net compute an output value every time a data vector is fed into the network. A scheme of a fully connected threelayered feed-forward network is shown in Fig. 2b. Input layer neurons simply distribute a data vector to the hidden-layer neurons. They are often referred to as `fan-out units' because they simply distribute the data vector to the next network layer without any calculation being performed. The input layer of SNN is identical to the input layer of UNN (Fig. 2). In the hidden layer neurons with a sigmoidal activation function (or `transfer function') are used. A sigmoidal neuron computes an output value according to: Sigm
input where input
1 , 1 eÿinput
X wi x i ÿ W: i
Here w is the weight vector connected to the neuron, x is the neuron's input signal, and W is the neuron's bias or threshold value. If a single sigmoidal output neuron is used the overall function represented by the fully connected two-layered feed-forward network shown in Fig. 2a is:
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
185
X f
x Sigm wi x i ÿ W , i
where x is the input vector (data vector). The network shown in Fig. 2b with sigmoidal hidden units and a sigmoidal output unit represents a more complicated function: X X out f
x Sigm vj Sigm wi,j x i ÿ Wj ÿ W , j
i
where w are the input-to-hidden weights, v are the hidden-to-output weights, W are the hidden layer bias values, and Wout is the output neuron's bias. The more layers are present in a network the more complicated overall functions can be represented. At most two hidden layers with non-linear neurons are required to approximate arbitrary continuous functions (Cybenko, 1989; Hornik et al., 1989). However, depending on the application and the accuracy of the ®t the required number of layers and the number of neurons in a layer can vary. In general the number of free variables in the system, i.e. the number of weights values, should be smaller than the number of data points in the training set to avoid having an over-determined system. Baum and Haussler (1989) addressed the question of what size of a network can be expected to generalize from a given number of training examples. A `generalizing' solution derived from the analysis of a limited set of training data will be valid for any further data point leading to perfect predictions. This is possible only by the use of data which is representative of the problem. Most solutions found by feature extraction from lifelike data sets will, however, be sub-optimal in this general meaning (Section 2.4). Andrea and Kalayeh (1991) proposed a rule of thumb, according to which the ratio of data points available (training data) and network weights, r, should be around 2, where r
number of training data points : number of weights
A systematic investigation of ideal ranges of the parameter r in QSAR studies was performed by Manallack and Livingston (1995). They found that a r value in excess of 3 is needed to keep the risk of memorization (over-learning) low. It is not trivial to select an appropriate network architecture for a given task. Several techniques for optimization of SNN architecture have been developed or adapted, including `weight pruning' or genetic algorithms (Hinton, 1986; Sietsma and Dow, 1988; Harp et al., 1990; Lohmann, 1993; Bishop, 1995; Lohmann et al., 1996). Further rules for selecting an appropriate size of the hidden layer in bpnetworks have been compiled by Devillers (Devillers, 1996b). Determination of a useful architecture by testing the performance of a SNN with independent test data during and after SNN training is essential. The result gives a useful feedback to the researcher. If the test performance is bad either the network architecture must be altered, or the data used for training is inadequate, i.e. the problem might be ill-posed. Furthermore, by monitoring test performance the generalization ability can be estimated during the training process and training can be stopped before over-®tting occurs, i.e. before full convergence on the training data.
186
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
During SNN training the weights and threshold values of the network are determined which is an optimization procedure. For training a set of data vectors plus one or more functional values per data point are required. The optimization task is to adjust the free variables of the system in a way the network computes the desired output values. The dierence between the actual output and the desired output is minimized; a frequently used error function is the sumof-squares error (N is the number of data points (patterns) in the training set): E
N X i1
outputactual ÿ outputdesired 2 : i i
The sum-of-squares error function isPa P special case of a general error function called the Minkowski-R error (Bishop, 1995): E n ck1 jyk
xn ;w ÿ tnk jR , where yk is the actual network output, tk is the desired (target) output vector, x is the input vector, and w is the weight vector. The summation over k will be omitted if only a single output neuron is used. With R = 2 the Minkowski-R error reduces to the sum-of-squares error function, R = 1 leads to the computation of the conditional median of the data rather than the conditional mean as for R = 2. The use of an R value less than 2 can help reducing the sensitivity to outliers which are frequently present in a training set, especially if experimental data are used. Several optimization algorithms can be used, e.g. gradient descent techniques, simulated annealing procedures, or evolutionary algorithms performing some kind of adaptive random search. Most frequently applied are gradient techniques using the `back-propagation of errors' (bp) algorithm for weight update (Rumelhart et al., 1986). Gradient descent follows the deepest descent on the error surface: @E @E @E , ,..., : DE
w @w1 @w2 @wl For network training each weight value is incrementally updated according to: Dwi ÿg
@E ; @wi
i 1,2, . . . ,l:
g is a constant de®ning the step size of each iteration along the downhill gradient direction. If the weights of a two-layered network (Fig. 2a) must be updated we make these weight changes individually for each input pattern xm in turn, and obtain the `delta rule' (also termed `Adaline rule', `Widrow±Ho rule', or `LMS (least-mean-square) rule') (Rumelhart et al., 1986; Widrow and Ho, 1960): Dwi,k g
tmi ÿ ymi x mk : Using a continuous dierentiable activation function for the hidden units of a multi-layered network, it is an easy step to calculate the update values (`deltas') for hidden unit weights following the bp procedure. A thorough description of this algorithm can be found in many textbooks on neural networks (see, e.g. Bishop, 1995). Several modi®cations of bp have been described (Hertz et al., 1991). As with all optimization techniques gradient methods can have problems with premature convergence due to local energy barriers or plateaus of the error surface (McInerny et al., 1989). This will be no problem if the network is small enough to de®ne a comparatively low-dimensional space
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
187
lacking striking local optima. However, if large networks are trained classical bp-based methods might easily get trapped in a local optimum. For training of complicated networks especially adaptive evolutionary optimization can be useful (Schneider et al., 1996). Besides considerations concerning network architecture considerable eort must be spent in selecting an appropriate SNN training technique. A description of maximum likelihood techniques and Bayesian learning is provided by Bishop (1995) and Ripley (1996).
2.4. Testing the generalization ability of neural network models To assess the usefulness and applicability of trained SNN and UNN the systems must be tested, just as any other prediction tool used in (Q)SAR. This can be done in several ways, and a variety of statistical techniques serve this purpose (for review of rule extraction and generalization, see (Denker et al., 1987)). Cross-validation and bootstrapping procedures are well-suited; see Kubinyi (1993a) for details on signi®cance and validity of QSAR equations. One simple possibility is to divide the data available in only two sets, a training set and a test set. The training data are randomly divided into subsets for cross-validation of the training results and to optimize network architecture. The test set contains independent data which is presented to a fully trained network to evaluate its prediction or recognition ability. A slightly dierent method is to perform a complete leave-one-out procedure, where there are trained as many networks as there are data points. One network is trained at a time using all but one data point for training. The performance of the networks on the left-out compound can be used to assess the accuracy of test data prediction. Statistical measures like correlation coecients, information indices, and various error functions are useful for this purpose (Ash, 1965; Kubinyi, 1993a; Ripley, 1996). A cross validated r2, for example, is frequently used in SAR studies. It provides an estimate of model predictivity by comparing the predictive sum of squares against the sum of squared deviations of each desired network output, yobs, to its mean, hyobsi (Good, 1995): X
yobs ÿ ypred 2 : cross ÿ validated r2 1 ÿ X
hyobs i ÿ yobs 2 This technique allows dierent models containing dierent numbers of components (e.g., numbers of network layers, neurons and weights) to be evaluated. For estimating the accuracy of a two-state prediction the correlation coecient according to Matthews (1975) is frequently employed: PNÿOU , cc p
NU
NO
PU
PO
where P is the number of correctly predicted `positive' examples (state 1), N is the number of correctly predicted `negative' examples (state 2), O is the number of false-positives, and U is the number of false-negative predictions. The value of cc is in [ÿ1,1], where cc = 1 indicates perfect prediction.
188
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
2.5. Special network architectures and hybrid models Multivariate display methods are of considerable importance to data visualization in molecular design and SAR, as well as for reduction of data dimensions (Devillers, 1995). The aim is to reveal the patterns hidden in complex data (Hudson et al., 1989; Domine et al., 1996). A number of dierent methods which can be categorized as being linear or non-linear have been described. Popular techniques in the ®eld of (Q)SAR are principle component analysis (PCA) which is a linear method, multidimensional scaling (MDS) (Shepard, 1962a,b; Kruskal, 1964) and Sammon mapping (Sammon, 1969; Devillers, 1995; Agra®otis, 1997), a non-linear technique. Of considerable interest is a neural network approach termed `encoder' networks or ReNDeR (reversible non-linear dimension reduction) (Livingstone, 1996). The architecture of an encoder network is illustrated in Fig. 5. The network is symmetrical around a central parameter layer. The number of input and output neurons is de®ned by the dimension of the data vectors, and the task of the system is to reproduce the input data at the output layer (auto-association) via an internal representation which is of lower complexity than the original data. In Fig. 5 the parameter layer consists of only two neurons (although this layer can have an arbitrary number of neurons) for a reduction of the data to only two dimensions (factors). Once the network weights are optimized by conventional supervised training techniques the input data can be described by the output values of the neurons in the parameter layer. Two or three neurons are especially useful for graphical display. If no intermediate hidden layers are used and the neurons are linear the factors found by the
Fig. 5. Architecture of an encoder network (ReNDeR) mapping a seven-dimensional input to two dimensions. White circles: fan-out neurons; gray circles: sigmoidal neurons; black circles: linear neurons. After network training the data projection onto a plane can be displayed by plotting the output of the central neurons.
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
189
encoder network are equivalent to principle components (Hertz et al., 1991). Presence of hidden layers containing non-linear neuron activities enables the system to perform non-linear mappings of the data, quite similar to Kohonen mapping (Salt et al., 1992). A unique attraction of encoder networks is, however, the possibility to (re)construct data of the original dimension directly from low-dimensional projections. In a two-dimensional map derived from high-dimensional QSAR data, for example, any position is directly linked to the corresponding original space. This might be useful for compound selection with a desired property (or `activity') by inspection of a low-dimensional graphical display. Until now, however, this design technique is still in its infancy. Applications, advantages and drawbacks of the method have been critically discussed (Reibnegger et al., 1993; Livingstone, 1996). An example of data visualization and classi®cation using principle component analysis, Sammon mapping, a Kohonen-network, and an encoder network is shown in Fig. 6 to discuss some of the speci®c properties of each method. PCA was performed using the commercially available software package TSAR V3.1 (Oxford Molecular, Oxford), all other software tools were developed in-house at Roche (Schneider, unpublished). For the example application, the 20 naturally occurring amino acids (cystine was treated as cysteine) were described by the respective values of seven partly highly correlated physicochemical properties (Schneider and
Fig. 6. Mapping of amino acid residues by dierent methods. Seven-dimensional descriptions of amino acid residues were projected to planar display by principal component analysis (PCA, upper left), Sammon mapping (upper right), a non-linear encoder network (lower left), and a 4 4 Kohonen-network (lower right). The locations of the individual amino acids in the maps are indicated by their single letter code.
190
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
Wrede, 1993a): hydrophobicity (Engelman et al., 1986), volume (Harpaz et al., 1994), bulkiness (Jones, 1975), surface area (Chothia, 1975), hydrophilicity (Hopp and Woods, 1981), and refractivity and polarity (Jones, 1975). The task was to generate two-dimensional projections of the seven-dimensional data. In our example the four techniques led to comparable results, i.e. several groups (clusters) of amino acid residues have become visible. PCA reveals some clear clusters by linear projection onto the plane spanned by the ®rst two principal components, pc1 and pc2 (Fig. 6, upper left). The principal components can be regarded as linear combinations of features with high variance. This method is fast, and the relative explanation power of the features (data variance covered) can be expressed by the corresponding eigenvalues (for review, see Jackson, 1991). However, PCA is limited by its linearity: projections from high-dimensional space to low-dimensional space can produce misleading relationships between individual data points, especially if high-dimensional data is analyzed. This eect has been studied in detail by Devillers (1995). In contrast to PCA, Sammon mapping aims at retaining the original relative orientations of the data points in low-dimensional space (Fig. 6, upper right). The non-linear projection generated might, therefore, more accurately display the relative distances between the data. However, Sammon mapping is very time-consuming if many data points are used, and the `factors' extracted have no statistical explanation power. Furthermore, the maps generated result from an optimization procedure and are thus not fully reproducible. The same objections hold for the encoder network (Fig. 6, lower left). Nevertheless, there is a potential application of encoder networks to molecular design, as mentioned above. The Kohonennetwork led to a reproducible, non-linear projection of the data (Fig. 6, lower right). In addition, an amino acid classi®cation scheme is generated automatically. The major drawbacks of this technique are the loss of information about the original distances between the data points, and the lack of statistical meaning of the network axes. Despite the obvious limitations of the individual methods, combinations of dierent statistical approaches and neural networks, especially PCA and Kohonen-networks, complement each other well and can be very useful for molecular diversity analysis, compound clustering, and visualization (Section 4). A similar conclusion was drawn by Kocjanc° ic° and Zupan (1997) who compared PCA, Kohonennetwork, and an encoder-type network taking the classi®cation of olive oils as an example. Two types of hybrid networks shall brie¯y be mentioned here, radial basis function (RBF) networks (Moody and Darken, 1989; Niranjan and Fallside, 1990) and counter-propagation networks (CPN) (Hecht-Nielsen, 1987, 1988; Huang and Lippmann, 1988). RBF networks provide a framework combining both unsupervised and supervised network training. A special advantage of these networks is the short training time needed compared to multi-layered SNN methods (Hertz et al., 1991; Bishop, 1995). Their basic principle is the use combinations of localized functions (e.g. Gaussians) to approximate a given input/output relationship, where each basis function acts like a hidden unit. The idea is to pave the part of the input space where the input data is located with their receptive ®elds. The application of RBF networks to molecular design and (Q)SAR studies is limited until now, although they belong to the most widely used neural networks in other areas. It appears likely and reasonable that these systems will soon be adapted for SAR modelling and molecular design. Recently a comparison of multi-layered feed-forward networks with RBF networks for the task of peptide classi®cation has been published (Schneider et al., 1995a).
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
191
Counter-propagation networks (Hecht-Nielsen, 1987; Carpenter and Grossberg, 1991) have found several applications in the various ®elds of (Q)SAR (Peterson, 1992, 1995; Zupan and Gasteiger, 1993; Novic° et al., 1997, Vrac° ko, 1997), and sequence analysis (Wu, 1996). Their architecture consists of two layers, a Kohonen layer in¯uenced by the data vectors and an output layer (Grossberg layer) in¯uenced by target values (`correct' or `expected' answers). The weights connecting the input and the hidden layer are determined by a standard competitive learning rule in a similar manner to Kohonen-network training. As a result, the hidden units divide up the input space in a Voronoi tessellation. A set of targets must, however, be known to train the output layer using the conventional delta rule. The greatest appeal of CPN is their speed; they can be trained 10±100 times faster than bp-networks yielding comparable results. An introduction to CPN systems for chemical applications can be found in Zupan and Gasteiger (1993). Adaptive resonance theory (ART) based arti®cial neural networks were originally designed to model real-time pattern recognition in the brain (Grossberg, 1976a,b; Carpenter and Grossberg, 1991). Recently they were adapted for QSAR modelling (Wienke et al., 1996). Most ART networks belong to the group of UNN. Their principle is to categorize data by checking whether an input pattern ®ts into (resonates with) an existing network structure or not. In the resonance case weight adaptation takes place. If, however, the input pattern belongs to a hitherto unseen data class a new weight vector will be initialized, and the network grows. ART systems can help solving the `stability±plasticity dilemma' of most ANN because they are stable enough to preserve old features but are ¯exible enough to change their structure and incorporate new information. It has been argued that the original ART networks are somewhat tricky to adjust and are very sensitive to noisy data (Hertz et al., 1991), recent developments and combinations with fuzzy systems may help overcoming these problems (Kosko, 1992; Zupan and Gasteiger, 1993; Wienke et al., 1996; Wienke and Buydens, 1996). A further neural network technique, termed `holographic neural network' (Sutherland, 1990), has recently been introduced to the analysis of chemical data (Burden, 1998). The method provides a discriminant technique which seems to be especially suited for classi®cation of overlapping data sets. An attractive property of this system is that it gives a de®nitive model, unlike a bp-network which may lead to a dierent model each time it is trained. However, a rigorous comparison to bp-networks and other network types remains to be done. In a number of simplifying test cases the holographic network technique was comparable, sometimes even superior to conventional statistical discriminant methods (Burden, 1998). Vrac° ko (1997) describes a `spectrum-like' representation of geometric and electronic molecular attributes in a counter-propagation network approach which follows a concept related to `holographic' representation of molecules.
3. Molecular descriptors and data pre-processing An important concept of three-dimensional molecular similarity was coined by Carbo and co-workers who proposed a similarity index originally based on electron density distributions (Carbo et al., 1980). Many other similarity measures have been invented based on very dierent 2D and 3D descriptors. An idea of neural networks is to automatically ®nd a
192
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
similarity function for a given data representation rather than using a pre-de®ned similarity index (an elegant combination of established similarity measures and ANN has been developed recently (So and Karplus, 1997a,b), cf. Section 4.2). However, the data used for neural network training (or development of any other SAR model) must be represented in a way that underlying trends or features can be recognized. In other words, molecules that behave similar in a certain context or have a comparable biological activity should have a similar representation in terms of molecular descriptors, because molecular design involves recognition of patterns and subsequent attempts to classify the patterns. To obey the `principle of strong causality' (Rechenberg, 1973; Holland, 1975), a prerequisite for any rational optimization, similar molecular activity should correspond to similar molecular descriptors or representations. However, several cases are known where small changes in the molecular structure of a drug resulted in drastical functional changes, sometimes even in a complete loss of activity (for review, see BoÈhm et al., 1996). It is evident that the relative importance of individual molecular features varies depending on the particular molecular activity studied. ANN are able to recognize and extract those features which are responsible for activity and classify compounds accordingly. Nevertheless, it is essential to focus on the appropriate level of abstraction for descriptor type selection. This means, for example, that a coarse-grain pharmacophore description might be well-suited for a rough classi®cation of data base entries, whereas a more detailed description containing, e.g., information about speci®c binding modes of potential enzyme inhibitors will be required for quantitative SAR modelling. It is desirable to develop low-dimensional descriptors which contain exactly the information needed to answer a particular question. To a certain extent the need for this view of molecules in terms of patterns stems from shortcomings of current ab initio methods, mainly the long computer time needed for calculation of molecular properties. Pattern recognition is a heuristic approach to derive structure±activity relationships. Keeping this in mind, a complete molecular design pattern may be de®ned by molecular syntax (the `logical' structure, topology index, molecular building blocks) and/or semantically with respect to a speci®c context-dependent meaning (e.g., experimental values or physicochemical properties). For dierent representation schemes, a more detailed discussion, and alternative views, see the literature (e.g., Ugi et al., 1990; Balaban, 1997; Klein and Babic, 1997; Randic, 1997; So and Karplus, 1997a). The relative importance of any one of the descriptors may vary fundamentally for dierent molecular `activities' studied. For example, the positive charge of a lysine residue located in the active site of an enzyme can be important for the formation of a ionic interaction with a negatively charged inhibitor, but located within the hydrophobic core of the protein its lipophilic character might be crucial. Understanding this context dependency of individual molecular descriptors is a key to realistic SAR models. Due to their architecture and inherently parallel way of information processing and feature extraction ANN are well-suited for this purpose. Three major descriptor types containing information about . overall (`bulk') molecular properties (e.g. lipophilicity, shape, overall charge), . molecule architecture (2D or 3D structural information), and . site-speci®c properties (e.g. fragment lipophilicity, point charges)
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
193
are frequently employed for representations of various types of molecules. Ever since the pioneering work of Hansch et al. (1962, 1963) a vast number of dierent molecular descriptors has been invented to establish useful QSAR equations for small organic compounds. Comprehensive introductions to the principles of such descriptors for QSAR can be found in the literature (Kubinyi, 1993a,b; Dean, 1995; van de Waterbeemd, 1995a,b; Hansch and Leo, 1995; BoÈhm et al., 1996). The potential of multivariate pattern recognition techniques in connection with potent molecular descriptors has been highlighted (Hyde and Livingstone, 1988). An extensive compilation of molecular descriptors for electronic, steric, and hydrophobic interactions is available from a publication of Jurs et al. (1995). For small molecules fast 3D structure generators based on 2D representations are at hand which allows 3D descriptors to be derived if no experimental spatial structures are available (Sadowski and Gasteiger, 1993, 1994). In particular the following 2D and 3D descriptors are often employed in (Q)SAR studies, including neural network approaches: electronic, hydrophobic and steric substituent constants; partition coecient (Log P, Log D); topological and connectivity indices; substructure-based descriptors; molecular shape indices; electronic whole molecule descriptors; reactivity indices; geometric descriptors (volume, surface area, charged partial surface area). Several hundred dierent descriptors have been invented to cover the respective context of a SAR, and there seems to be a set of `basic invariants' of molecular graphs which can be calculated directly from the molecular structure of an organic compound as claimed by Baskin et al. (1993, 1995, 1997). Table 2 First ®ve principal components (PC) derived by PCA from a collection of 143 amino acid properties (for PCA the original data of Tomii and Kanehisa (1996) was used); columns scaled by unit variance Amino acid residue Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
PC 1
PC 2
PC 3
PC 4
PC 5
0.11 0.38 ÿ1.38 ÿ0.74 1.36 ÿ1.3 0 1.54 ÿ0.67 1.33 1.27 ÿ1.2 ÿ1.28 ÿ0.45 ÿ0.47 ÿ0.97 ÿ0.36 1.23 1.19 0.43
0.32 ÿ0.78 0.46 1.76 ÿ0.27 2.09 0.77 ÿ0.67 1.68 0.06 0.66 ÿ0.1 ÿ1.62 1.03 1.3 ÿ0.74 ÿ0.6 ÿ0.7 0.16 ÿ0.62
2.09 ÿ0.33 ÿ0.07 1.01 ÿ0.49 1.09 ÿ0.74 0.37 ÿ0.09 1.35 0.33 ÿ0.39 ÿ1.09 ÿ0.09 ÿ0.92 0.52 0.31 0.9 ÿ1.89 ÿ1.87
0.54 ÿ1.54 0.09 0.84 0.04 ÿ1.32 ÿ0.87 0.48 0.13 0.81 ÿ0.13 ÿ1.16 3.34 0.11 ÿ0.37 ÿ0.54 ÿ0.27 0 0.21 ÿ0.38
0.51 2.03 1.33 1.32 0.14 ÿ0.35 0.36 ÿ0.81 ÿ1.69 ÿ0.41 1.16 0.21 0.36 ÿ0.19 ÿ1.8 ÿ0.47 ÿ0.84 ÿ1.02 0.93 ÿ0.74
194
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
In Table 2 a list of ®ve descriptor scales for amino acid residues is given which were derived by PCA from 143 property scales collected by Tomii and Kanehisa (1996). These uncorrelated (orthogonal) scales can be used for representation of amino acid residues in SAR studies and neural network training. PC1 is highly correlated with hydrophobicity scales, PC2 with helix propensity, PC3 with volume and amino acid frequency, and PC4 with polarity. PC5 does not show signi®cant correlation to any of the 143 original properties. These results are in accordance with an earlier work based on a collection of 180 property scales (Kidera et al., 1985). Recently a set of amino acid descriptors for QSAR modelling has been published which is based on ab initio quantum mechanical calculations (Norinder and Svensson, 1998). It is well possible that these descriptors complement the information contained in more empirically derived descriptors. A challenging idea is to directly derive a context-dependent weighing scheme from formal representations of molecules, rather than pre-calculating or selecting a set of physicochemical descriptors (for a thorough discussion, see Randic, 1997). This is a ®eld where neural networks excel, and several such neural network applications have been described for solving SAR and drug design problems (Aoyama et al., 1990; Aoyama and Ichikawa, 1991, 1993; Gasteiger and Zupan, 1993; Barlow, 1995; Baskin et al., 1993, 1997). Qian and Sejnowski (1988) published the methodological principle taking the prediction of protein secondary structure from the amino acid sequence as an example. Their idea was to represent each amino acid residue by a 20-digit unitary number, i.e. a number consisting of 19 zeros and a single 1 (also termed `sparse encoding' or `distributed encoding' (Hirst and Sternberg, 1992; Presnell and Cohen, 1993; Schneider and Wrede, 1993b)). A SNN trained on secondary structure prediction develops input weights for each digit. After training these weights express context-dependent `properties' of the individual residues. In other words, the network input weight vector introduces a residue ordering: similar weights correspond to similar residues. Following this approach ANN introduce a context-sensitive similarity criterion. A related concept was followed by Brunak and co-workers (Tolstrup et al., 1994). They trained a network to classify a formal representation of the 61 nucleotide triplets of the genetic code into 20 amino acid categories. The resulting internal network representation of the amino acids strongly correlates with the GES scale of transfer free energies (Engelman et al., 1986). Sometimes graphical display of weight vectors by Hinton diagrams (Qian and Sejnowski, 1988) can help understanding the features extracted. The numerical weight values indicate contributions of an input variable (if input weights are analyzed) or individual features (if hidden weights are analyzed) to the SAR model (Schneider et al., 1993, 1995b, 1998). A method for estimating the relative importance of individual input variables of encoder networks has also been described (Kocjanc° ic° and Zupan, 1997). The unitary coding scheme leads to an orthogonal representation of the entities described (residues, functional groups, or atoms). An advantage is that ANN can automatically derive an ordering of the entities based on the training data. However, a clear disadvantage is the fact that this data description often is high-dimensional. This in turn demands for a large number of training data which might be available for protein or nucleic acid sequence analysis, but in most cases only a few in vivo or in vitro pharmacological data are at hand. Correlation functions are of particular attraction for representation of both small molecules and polymer sequences because they lead to a compact description containing relevant
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
195
Fig. 7. Principle of the auto-correlation method for 2D representations of molecules. Step 1: construction of the molecular graph. Step 2: calculation of the shortest distance between all atoms (expressed as number of bonds). Step 3: Selection of a property for correlation, here the property `nitrogen' was used. This led to entries in the correlation matrix for the atom pairs (1/7) and (7/1) (auto-correlation of nitrogens over four bonds). For calculation of the ®nal auto-correlation vector, see the text.
structural information (Jones, 1975; Moreau and Broto, 1980; Wagener et al., 1995). For each molecule the resulting correlation vectors have identical length, and the vector elements contain equivalent information. This is a necessary prerequisite for feature extraction by ANN because the input layer size of a network is usually ®xed. Fig. 7 shows the principle of the atom-based auto-correlation method for the 2D case. First the molecular graph neglecting hydrogens is determined (step 1). Then, a table is calculated giving the inter-atomic distances expressed as numbers of bonds between pairs of atoms, d(i,j), thereby providing the neighborhood relations between atoms (step 2). In step 3, an atomic property P is selected (here, the property `nitrogen' was used, and for nitrogen atoms `1' is added to the table), and the auto-correlation vector for the property selected, AV = (C0, C1, C2, . . ., CN), is calculated, where: X P
iP
j: Cn i,j
In our example, an auto-correlation of the property `nitrogen' over four bonds is observed leading to AV = (2,0,0,0,2,0,0,0). Following this strategy, dierent properties can be combined to form a composite auto-correlation vector. For example, de®nition of hydrogen-bond donors and acceptors, charged groups, and lipophilic centers can provide a useful level of abstraction for representation of molecular compounds, especially for inhibitor design (Fig. 8). An advantage of 2D over 3D correlation vectors is that the 2D structure of a molecule is always available. 2D correlation vectors can be used for rapid clustering of large sets of data (Section 4.2). However, lacking 3D information often prevents reasonable SAR models. Gasteiger and co-workers introduced topological auto-correlation vectors and self-organizing networks for identi®cation of dopamine and benzodiazepine agonists in large data sets (Bauknecht et al.,
196
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
Fig. 8. Location of functional groups in the 2D structure of the thrombin inhibitor PPACK. Lip: lipophilic; HA: hydrogen-bond acceptor; HD: hydrogen-bond donor; +: positive charge.
1996). Recently an application of auto-correlation vectors as structural descriptors for modelling antimalarial activity of molecules has been described (Calas et al., 1997). The pioneering work of Jones (1975) introduced correlation methods for protein sequence analysis. A useful grouping of amino acid residues has been conceived by Taylor (1986) (Fig. 9). This classi®cation scheme has been shown to be very useful for feature extraction from amino acid sequences (King and Sternberg, 1990; Schneider and Wrede, 1993c), and correlations between these residue classes provide valuable sequence information. An example is shown in Fig. 10. Here, the N-terminal parts of eukaryotic protein precursor sequences were represented by the auto-correlation of the ten basic Taylor groups (Fig. 9). Correlations over up to ten bonds were considered, leading to 10 10 = 100-dimensional auto-correlation vectors. A 10 10
Fig. 9. Venn diagram of amino acid residues according to Taylor (1986).
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
197
Fig. 10. Kohonen maps of N-terminal protein sequences. Neurons are drawn as squares. A Kohonen-network with 10 10 output neurons arranged in the plane was trained by cross-correlation vectors of amino acid residue classes derived from N-terminal parts of protein precursor sequences. The location of the individual sequences in the ®nal map is shown separately for cytoplasmic, secreted, mitochondrial, and nuclear proteins. Neuron shading gives the fraction of the respective sequence class clustered (white: no sequence found in the neuron; black: only sequences belonging to the sequence class under investigation were grouped together).
Kohonen-network was used to classify these data. The clustering results are unambiguous (Fig. 10): N-terminal portions of protein precursor sequences contain information about the protein targeting route, and this information can be expressed by the special coding scheme selected. The Kohonen-network was able to map the high-dimensional descriptor space onto a plane. The corresponding sequence space seems to be divided into three major regions: cytoplasmic, secretory, and mitochondrial (proteins from other compartments were not included in the analysis; previous studies using self-organizing feature maps for automatic sequence classi®cation revealed relationships between mitochondrial and chloroplast targeting signals (Schneider et al., 1997)). Nuclear proteins were clustered together with cytoplasmic sequences. Indeed, the targeting signals of nuclear proteins are not located at the N-terminus,
198
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
rather internal or C-terminal parts seem to contain the respective information (Silver and Goodson, 1989).
4. Applications of neural networks to protein and nucleic acid sequence analysis, compound classi®cation and molecular design In this chapter we summarize recent advances in protein and nucleic acid sequence analysis and present current concepts along with some worked examples for compound clustering, and the design of peptides and small organic molecules. Several pioneering ANN applications have already been discussed in the previous chapters. For a comparison of ANN with other techniques and concepts of arti®cial intelligence in sequence analysis, see the literature (Hunter, 1993; Doolittle, 1996). 4.1. Feature extraction and classi®cation of biopolymers Identi®cation of potential drug targets and the localization of structural and functional features in biopolymers is of great importance at the onset of a drug development project. Molecular biology turned into the phase of sequencing the complete genomes of many dierent organisms. The genomes of three archaebacteria (Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Archaeglobus fulgidus), nine eubacteria (Haemophilus in¯uenzae, Mycoplasma genitalium, Synechocystis sp., Mycoplasma pneumoniae, Helicobacter pylori, Escherichia coli, Bacillus subtilis, Borrelia burgdorferi, Aquifex aeolicus, Treponema pallidum), and the yeast Saccharomyces cerevisiae have been deciphered. Knowing the DNA sequence of a genome, that is speaking of several million base pairs, makes automatic identi®cation and classi®cation tools absolutely necessary. New insight into the high complexity of gene arrangements as well as of the gene structures themselves can be obtained from both rigorous comparisons of individual nucleotide sequences and whole genomes (Tatusov et al., 1997). Gene families which are present in bacteria only are potential targets of new antibiotics. However, the biological functions of derived protein primary structures is still unknown in up to 30% of the cases (Strachan and Read, 1996). The aim is to ®nd decisive feature(s) in a biopolymer sequence as an indicator of biological function. It is very well known that in most cases the appropriate SAR re¯ects a non-linear correlation, and ANN have proven their value for automatic non-linear feature extraction from biopolymer sequences. In the following part several recent advances of ANN applications to the identi®cation of promotor sites, splice sites, ribosomal binding sites, prediction of protein structure, identi®cation of membrane proteins, targeting signals, and recognition of glycosylation sites are described. Several years ago extensive reviews appeared summarizing earlier applications of ANN in molecular biology (Hirst and Sternberg, 1992; Presnell and Cohen, 1993). The steady increase of publications ever since re¯ects the usefulness of ANN approaches to many dierent classi®cation and modelling tasks in molecular biology and biochemistry.
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
199
4.1.1. Identi®cation of promotor sites and polymerase binding regions Almost from the beginning of molecular biology in the early sixties transcription signals, in particular promotor regions, have been characterised. Sequencing studies of many promotor regions in E. coli and other gram negative eubacteria revealed characteristic patterns around positions ÿ10 and ÿ35 upstream of the transcription start point (Rosenberg and Court, 1979; Lisser and Margalit, 1993). In E. coli the DNA-dependent RNA polymerase consists of four subunits (also designated as E) and the initiation factor s, a polypeptide of 70 kDa molecular weight. The s factor together with the apoenzyme E recognizes promotor sequences. Identi®cation of promotor regions was a convincing model application for neural networks. A recent work by Pedersen and Engelbrecht (1995) used experimentally veri®ed promotor site sequences for training of SNN. Promotor regions were scanned using a sequence window encompassing seven nucleotides. The authors observed maximal correlation for position 0, as expected, and in addition for positions ÿ10, ÿ22, ÿ33 and ÿ44. Positions ÿ22 and ÿ44 were not identi®ed in earlier work. Obviously RNA polymerase binds to more regions of the DNA than previously suggested. The sequence pattern found by the ANN, a repeated distance of 11 nucleotides, implies that the core RNA-polymerase binds to one side of the DNA helix only. Another important signal is the termination of transcription. The factor-independent terminator sequence consists of a G/C-rich dyad symmetry followed by a stretch of 4±8 adjacent thymidines. Again SNN were applied to pattern recognition in these regions. Two dierent coding schemes were used, unitary coding and a physicochemical property (the electron±ion-interaction potential, EIIP). EIIP coding leads to a one-dimensional description per nucleotide. A three-layered network and bp training led to convincing results with 98.1% correct predictions for unitary coding and 95.6% for EIIP coding in an independent test set (Nair et al., 1994). A bp-network system was trained on the identi®cation of E. coli promotor regions of all spacing classes (Mahadevan and Ghosh, 1994), where a three-module approach was employed: the ®rst module recognizes consensus boxes, the second aligns the promoters to a length of 65 bases and the third module investigates the entire sequence of 65 bases under consideration of the interactions between the bases of the promoters. Training was performed with 106 promotor sites, and the test was done with 126 examples. These neural network systems were rather successful in promotor identi®cation, showing a prediction accuracy of 90.2% for nonpromotor regions and 98% for promotor regions. Especially mutated promotor regions in plasmids could be identi®ed by this modular neural network system. In eukaryotes there are three dierent polymerases, Pol I±III, each of which has a certain function (Fickett and Hatzigeorgiou, 1997). Identi®cation of promotor regions in anonymous DNA sequences is an important task for gene analysis programs. Several studies of promotor sequences revealed species-speci®c sequences (Fickett and Hatzigeorgiou, 1997). There seems to be a non-linear relation between a sequence and its polymerase binding ability. Again, arti®cial neural networks turned out to reliably approximate this sequence-function relation (Cai and Chen, 1995; Kraus et al., 1996). Several multi-layered, feed-forward networks were used to characterize promotor regions, polyadenylation sites, and mutants responsible for transcription regulation (Nair et al., 1995; Larsen et al., 1995; Matis et al., 1996). Special care was taken in constituting proper data sets for training and testing ANN. Sequence regions of dierent size were analyzed to determine the local correlation coecient. The input layer window was kept
200
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
®xed at 71 nucleotides. Regions encompassing the positions (ÿ250; 250), (ÿ150; 150), (ÿ100; 100) and (ÿ60; 60) were selected, and the number of hidden layer units was varied from 0 to 100. Negative sequence examples were taken from the promotor region in order to come close to the natural conditions of polymerase recognition (Larsen et al., 1995). For identi®cation of single features in promotor regions asymmetric window sizes were used, and the maximum test correlation coecient was determined. Summarizing the results it was concluded that the most important region are the TATA-box (ÿ29; 24), the Cap signal (ÿ1; 0) and a region around (8; 10) (Larsen et al., 1995). These regions were also identi®ed in a Shannon plot, a quanti®cation of sequence information content along the sequence (Schneider and Stephens, 1990), except for the (8; 10) region. In addition, some information seems to be present around (ÿ19; ÿ17). Taking together all results the concerted interplay of the Cap signal and the TATA-box is necessary for the recognition of the transcription initiation site by polymerase. DNA-protein interactions have also been investigated by ANN. Using a bp-network characteristic features of zinc ®nger proteins were extracted. Several dierent physicochemical descriptors as well as secondary structure propensities were used for data representation. Prediction accuracy of 97% was obtained for TFIIIA type zinc ®nger DNA binding motifs, 96% were reached for the steroid receptor type zinc (Nakata, 1995).
4.1.2. Splice site identi®cation In eukaryotes, gene transcription leads to pre-mRNA which is processed within the cell nucleus. Since the protein coding regions (exons) are interrupted by intron sequences identi®cation of borders between introns and exons is a major task for gene analysis by bioinformatical methods. Until today no perfect prediction tool for exon/intron borders, the splice sites, in long anonymous DNA is available. Anyway, several approaches with neural networks for the identi®cation of splice sites were performed (Brunak et al., 1991; Uberbacher and Mural, 1991; Hirst and Sternberg, 1992). One of the earliest work was done by Brunak and co-workers (Brunak et al., 1990) who used a multi-layer feed-forward network for analysis of human gene sequences. The goal was to proof the assigned splice sites. Trained neural networks detected several mistakes in the published splice sites. In some genes errors have been found in data bank entries by this neural network analysis (Brunak et al., 1990). After correction the reading frame ®tted well. For the donor site identi®cation the best correlation coecient was obtained with 40 hidden units and a window size of 15 nucleotides. The Matthews correlation coecient was cc = 0.41. Clearly this result re¯ects some shortcomings either of data selection or network architecture, or premature convergence of the training process. Farber and co-workers described an identi®cation system for splice sites based on dierent codon frequencies in exons and introns (Farber et al., 1992). Since a signi®cant mutual information exists for adjacent codons in exons but not in introns, these dierences could be perceived by a simple perceptron already. For training back-propagation of error was used. The prediction accuracy depended on the length of exon regions. Those containing 60 codon ORFs gave an accuracy of 99.4% for the best neural net scheme which was a perceptron. The neural net turned out to be superior to conventional Bayesian prediction schemes based on 90 codons.
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
201
`GeneParser', a combination of an ANN with dynamic programming, was developed for identi®cation of the coding regions in genomic DNA reaching a correlation coecient of 0.89 for an independent test set of human genes (Snyder and Stormo, 1993, 1995). The program considers all subintervals in a sequence for content statistics. Some obvious characteristics imply non-overlapping introns and some statistically signi®cant sequence properties of the region near the intron±exon boundary. Based on this data a dynamic programming algorithm is applied to ®nd the combination of introns and exons that maximizes the likelihood function. Several sub-optimal solutions can be obtained, each of which is the optimum solution containing a given intron±exon junction. For a special (G,C)-rich gene a correlation coecient of 0.94 was obtained. Another example of intron and exon identi®cation in junk DNA used a simple bp-neural network (Granjeon and Tarroux, 1995). Junk DNA provided negative examples. The correlation coecient obtained for discrimination of introns and exons was 0.50 and 0.64, respectively. A recent study using simple perceptrons for the direct identi®cation of donor splice sites revealed the importance of sequence encoding schemes and training algorithms (Malik et al., 1996). From the algorithms tested the evolution strategy (Rechenberg, 1973) led to the highest prediction accuracy for donor splice sites yielding a relative trans-information of Trel=0.723 and a Matthews correlation coecient of cc = 0.799. A binary encoding scheme was superior to physicochemical property descriptors. The weights of the trained perceptron correspond to certain properties of the donor splice site signal and led to the design of an idealized sequence: C(C/T)C/A)AG*GTAAGT (* indicates the splice site). The conclusion that neural networks are very reliable tools for splice site recognition is supported by Ogura and co-workers who used a ¯exible neural network with dierent architectures (Ogura et al., 1997). Their system was able to identify splice site mutations in the factor IX gene which may cause hereditary disease. GRAIL II is a neural network system for recognition of protein coding regions, i.e. exons in human DNA (Xu et al., 1994). The identi®cation of exon regions results from the generation of a candidate pool containing all possible exons, translation start sites, acceptor±donor sites and translation stop in all reading frames. Some heuristic rules are applied to vast elimination of candidates. The center piece of the system are three neural networks for the exact identi®cation of the starting exon, internal exon and terminal exon candidates. The special advantages of GRAIL II are the acceptance of a variable length of sequence windows and that the success of exon prediction is nearly independent of exon length. This program located 93% of all exons independent of their size with an overall false-positive rate of 12%. Hebsgaard and co-workers combined ANN and a rule based system to predict intron splice sites in DNA sequences of the plant Arabidopsis thaliana (Hebsgaard et al., 1996). The authors combined local and global feature extraction techniques and ended up in a prediction system which is claimed to be signi®cantly better than other splice-site prediction methods. A main advantage of this system is the small number of false±positive predictions. 4.1.3. Ribosomal binding sites Identi®cation of the ribosomal binding site and of the initiation sites for translation in mRNA is important for characterizing anonymous genes. Ribosome binding depends on
202
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
several parameters, namely certain sequence patterns around the initiation codons, cap structures, the distance between the start codon and the Shine±Dalgarno (SD) sequence, and several sequences far up- or downstream of the ribosome binding region (Pedersen and Nielsen, 1997). The existence of several not strictly locally encoded features makes it dicult to identify such sites from a local sequence window only. Especially secondary structures have a great in¯uence on the strength of ribosome binding (Bisant and Maizel, 1995; DeSmit and van Duin, 1994). Besides early studies on the use of a perceptron algorithm for recognition of translation start sites (Stormo et al., 1982) only a few studies have recently been performed. Perceptrons and a supervised multi-layer network trained on a larger data set were applied (Bisant and Maizel, 1995). The best prediction result was obtained for an input window size of 101 nucleotides and 9 hidden neurons, yielding 75% correct predictions with only 0.08% false positives.
4.1.4. Prediction and classi®cation of protein structure Surprisingly arti®cial neural networks were ®rst applied to the prediction of RNA structure and function at the beginning of the eighties, and only several years later studies on amino acid sequences have been performed. The ®rst applications focused on secondary structure prediction from the amino acid sequence (Qian and Sejnowski, 1988). With the beginning of the nineties an increasing number of ANN studies appeared concerning structure and function of proteins (Brunak and Engelbrecht, 1996). Modeling sequence-function/structure relationships for amino acid sequences is a very complex task with many parameters to be considered and optimized. Ever since the pioneering work of Qian and Sejnowski many approaches have been developed for protein secondary structure prediction. A signi®cant contribution was done by Rost and Sander who introduced the multiple alignment of training sequences for a better overall accuracy and a balanced training (Rost and Sander, 1993a,b). A conventional three-layer bp-network and a data set of sequence pro®les instead of single sequences led to a higher prediction accuracy compared to the original work of Qian and Sejnowski (1988). Since there is an unequal distribution of the three secondary structural elements in globular proteins the training set was corrected accordingly. This resulted in a better prediction for b-strand. A further advance was the application of the network system to all-a, mixed-ab and all-b proteins. Two networks were coupled, the ®rst deals with the sequence-to-structure relation and the second with the structure-to-structure relation. The second network was trained to recognize the structural context of single residue states without any reference to the sequence. The output string of the ®rst network feeds the input of the second network. Such a network coupling did not lead to a higher overall accuracy but it improved the reproduction of helix and strand length by comparison to known average sizes of secondary structural elements. The majority of concurring outputs from 12 networks, called the `jury', give a further 2% increase of prediction accuracy. The ®nal prediction accuracy is 70% for helix, 72% for loop and 64% for strand. Further details can be found in Rost (1996). On the way to improve prediction accuracy for the three-dimensional fold of proteins algorithms for identi®cation of super-secondary structures can be helpful. A recent work on the prediction of super-secondary structures by applying a three-layer network revealed correlation coecients in the range of 0.39 to 0.57 depending on the predicted motif (the results
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
203
correspond to equivalent correctness ratios of 68% to 80.6%; with the correctness ratio = m/n; n is the occurrence of motif A, m is the number of predicted motif A, Sun et al., 1997). There are many ideas of how to improve the prediction accuracy for protein secondary structure. One approach is to enlarge the data set as much as possible in order to catch all possible dierent structures. By using 318 non-homologous proteins 67% accuracy was reached without any input information derived from multiple sequence alignments (Chandonia and Karplus, 1996). By introducing sequence pro®les the performance was increased to an overall value of 72.9%. The prediction of the three-dimensional structure of a protein solely from its amino acid sequence will be of great importance for identi®cation of potential drug targets and structurebased molecular design. The complete genome of the yeast Saccharomyces cerevisiae revealed about 30% open reading frames for proteins with a very low similarity to known sequences (Oliver et al., 1992; Mewes et al., 1997). In case of the human genome this ratio is even much more unfavorable (Strachan and Read, 1996). Therefore, prediction tools are very important, although partly inaccurate (Burset and Guigo, 1996). Most of the current tertiary structure prediction methods are based on very successful sequence alignment methods (Jones, 1997). A correct prediction is mainly based on the correct alignment, and sequence-structure homology thresholds can be used as a guideline (Sander and Schneider, 1991; Abagyan and Batalov, 1997). Nevertheless, neural networks systems have recently been applied to the prediction of folding classes (Bohr et al., 1993; Metfessel et al., 1993; Dubchak et al., 1993, 1995). Wu and co-workers developed a neural network approach for rapid protein family identi®cation which is claimed to be more robust than conventional motif and pro®le search, and hidden Markov models (Wu et al., 1996). The diversity of proteins originates to a large extent from non-regular coil and loop structures connecting dierent numbers of regular secondary structure elements. Many attempts have been made to classify all kind of loops and it became clear that several motifs are recurring (Leszczynski and Rose, 1986). In order to characterize local protein structures an auto-associative network has been developed to discover intrinsic features of protein structures (Fetrow et al., 1997). The trained network was applied to a larger database of proteins leading to structure classi®cation and subsequent amino acid frequency analysis, as well as to the identi®cation of common patterns of structural building blocks (SBB). Six SBB categories were found, two of them belong to a-helical or b-strand secondary structures while the other four are consistently found at the N- and C-termini of helices or strands, including regions identi®ed as `loops' or `random coil'. This approach is very versatile since classi®cation of more detailed structural features of amino acid sequences can be deduced by the neural network (Fetrow et al., 1997). Another approach for an unbiased classi®cation of structural building blocks of proteins was performed with self-organizing networks (Schuchhardt et al., 1996). Sequence windows encompassing eight residues were described by a sequence of the two dihedral angles f and c. The structural information was derived from the PDB-Select data set (Hobohm et al., 1992). During unsupervised network training a two-dimensional feature map containing 10 10 neurons was developed. Dierent local structural motifs were found by the network including typical helical and b-strand elements and all kinds of loop structures. The Kohonen map re¯ects a topological projection of the high-dimensional input space onto a two-dimensional
204
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
map of structural elements. In one corner of the map the helical and helical-like elements were located while in the diagonal corner b-strand-like examples were found. A comparison of `trajectories' in the Kohonen map enables protein classi®cation independent of sequence length (Schuchhardt et al., 1996). An earlier approach towards classi®cation of protein structures into families by an unsupervised neural network gave a 15 15 neuron map. The dipeptide composition of amino acid sequences served as molecular descriptor. Proteins belonging to the same family were actually clustered at the same or an adjacent neuron (Ferran et al., 1994). A self-organizing network system for pattern recognition in protein sequences which is based on the identi®cation of common ungapped residue stretches using a similarity matrix for sequence comparison was developed by Reich and co-workers (Hanke et al., 1996). Similarity was expressed by converting sequence (domains, aligned sequences, segments of secondary structure) into a characteristic signal matrix (Hanke and Reich, 1996). Some understanding about the dynamic properties of a neural network system motivated an approach to simulating the dynamic properties of a protein (Liebovitch and Zochowski, 1997). Neural networks and proteins have several properties in common such as many degrees of freedom, con¯icting constraints on energy minimization especially for energy functions with several local minima. As an example system the two-state conformations of ion channel proteins were simulated by analyzing the dynamics of a Hop®eld network with 100 nodes and two memories according to the two states of an open and closed ¯ow of ions. The distribution of the ®rst passage times was used to characterize the dynamics of the network. The synchronous weight update led to a Hop®eld network which was more consistent with the expected physical properties than for an unsynchronous update. As a result of the neural network approach the authors conclude that a synchronous update of the position and velocity of one atom at a time is more meaningful in molecular dynamics simulations of proteins (Liebovitch and Zochowski, 1997). 4.1.5. Prediction of membrane-spanning sequences Prediction of integral membrane protein topology from an anonymous amino acid sequence represents a special problem in protein structure prediction. Especially recognition of ion transporters is a dicult task since their membrane spanning regions are very hydrophilic and cannot be detected by conventional hydrophobicity analysis (Kyte and Doolittle, 1982). Here arti®cial neural networks have proven to provide a useful alternative approach, although still about 5% of cytoplasmic proteins are erroneously predicted having transmembrane segments (Rost, 1996). For a rapid search of single-spanning transmembrane regions a low-order neural network ®lter has been designed (Huang et al., 1996). The input window encompasses 50 amino acid residues, and the ideal number of hidden neurons was six. The performance of the optimized network resulted in a 98% probability of true-positive detection and 6% for false-positives. With one exception all single-spanning membrane proteins were found. In addition, transmembrane regions of the immunoglobulin G-binding protein precursor were identi®ed, even melittin was successfully analyzed which raised problems for a helix-speci®c algorithm (Rost et al., 1995). Surprisingly three out of ®ve b-sheet membrane proteins could be identi®ed by this ®lter-system (Huang et al., 1996).
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
205
A commercially available neural network (BrainMaker) has been optimized for the identi®cation of multispanning transmembrane regions (Dombi and Lawrence, 1994). An increasing size of the training sets from 214 to 1751 sequences led to non-linear increase of learning success from 75 to 98%. The training sets consisted of nearly equal amounts of membrane-spanning sequences and non-membrane-spanning sequences. Characteristic amino acid distributions inside and outside of the membrane region were derived from weight factors analysis. As a general scheme transmembrane sequences revealed a polar region near the inner lipid interface, and a hydrophobic sequence including the length of the lipid bilayer. The network was tested for the prediction of transmembrane regions of the structurally known archaebacterial membrane protein bacteriorhodopsin. Since the network was not trained by the overlapping window technique the amount of overpredictions (false-positives) was high although the seven transmembrane helices were correctly assigned. A high prediction accuracy for identi®cation of transmembrane regions has been obtained with arti®cial neural networks in combination with sequence alignment procedures (Rost et al., 1995). Bp-network training was performed with experimentally veri®ed transmembrane and non-transmembrane sequence data. The input window length was 13 amino acid residues and the network output layer contained two neurons, one for each state of the central amino acid residue, namely for inside and one for outside of the membrane helix. The information stemming from multiple sequence alignments (pro®le) signi®cantly contributed to the overall prediction accuracy. All transmembrane regions were identi®ed correctly. From a rigorous cross-validation test on 69 proteins with experimentally determined locations of transmembrane segments the accuracy per residue was estimated at 95%. From a negative control set less than 5% of the sequences were predicted as transmembrane helix. The tools were applied to the identi®cation of transmembrane regions in 29 open reading frames of the yeast chromosome VIII. At least two transembrane helices were indicated for 59 proteins corresponding to about 25% of all proteins on yeast chromosome VIII, and about 20% showed multiple transmembrane segments (Rost et al., 1995). In a further approach towards identi®cation of transmembrane regions special care was taken in preparation of training data (Fariselli and Casadio, 1996). Single non-homologous membrane proteins were used. The prediction accuracy for transmembrane helices was 78%, and topology prediction yielded 70%. Two neural networks were coupled where the output of network 1 provided the input for network 2. Porins were not identi®ed by the network since their transmembrane regions are in b-sheet conformation (Fariselli and Casadio, 1996). A more recent development is the program `TransMem' (Aloy et al., 1997) which yields a prediction accuracy is in the range of 99.9% for the best and 71.7% for the worst prediction of the proteins tested. Identi®cation of transmembrane helices alone is not sucient for the characterization of membrane proteins. Additional topology information is necessary to predict protein function. An improvement of the method developed by Rost and co-workers (Rost et al., 1995) included the combination of several dierent ®lter systems as well the inclusion of expert knowledge (Rost et al., 1996). The observation that positively charged amino acids prefer to be located on the cytoplasmic side, the `positive-inside rule', provided valuable information (von Heijne and Gavel, 1988; Sipos and von Heijne, 1993). In a dynamic-like programming the output values of the network were combined leading to a slight improvement of the prediction of
206
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
transmembrane regions and their topology. A comparison of prediction accuracy for eukaryotic, prokaryotic and viral transmembrane proteins and their topology revealed a slightly better result for eukaryotic membrane proteins than for the others. A special ANN for detection of membrane-spanning regions in amino acid sequences has been developed in our group (Lohmann et al., 1994, 1996). The idea was to predict membrane/ non-membrane borders in the amino acid sequences rather than complete membrane-spanning segments, and to optimize both network architecture and residue encoding by an evolutionary algorithm. In contrast to most of the other network applications a set of physicochemical properties provided possible residue descriptors. The evolutionary algorithm applied performs a simultaneous optimization of both architecture and connection weights of neural ®lter system. It consists of the following steps: 1. 2. 3. 4. 5. 6.
Generation of an initial set of network structures. Assignment of separate populations. Optimization of network weights in each population during isolation time. Selection among the network architectures on the level of populations. Variation of the best architectures and replacing the worst. Start next cycle at step 3 with the partly new set of architectures.
In Fig. 11 the architecture of an optimized network is shown. Seven amino acid properties were used as network input. It turned out that for feature extraction from membrane/nonmembrane borders electronic properties (polarity, hydrophobicity) seem to be more important than steric properties (bulkiness, surface area). This result substantiates earlier ®ndings (von
Fig. 11. Architecture of a topology-optimized network for prediction of membrane-spanning parts of amino acid sequences. The number of hidden layer units and the connectivity of input and hidden units was subjected to evolutionary optimization. Sequence windows encompassing 13 residues were encoded by up to seven physico± chemical properties (1: hydrophobicity (Engelman et al., 1986); 2: volume (Zamyatnin, 1972); 3: surface area (Chothia, 1975); 4: hydrophilicity (Hopp and Woods, 1981); 5: bulkiness; 6: refractivity; 7: polarity (Jones, 1975)). After network optimization property 5 (bulkiness) remained unconnected.
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
207
Heijne and Gavel, 1988; von Heijne, 1989). The overall prediction accuracy for a test set of membrane proteins was 92%.
4.1.6. Identi®cation of protein targeting sequences The intracellular transport of newly synthesized proteins is regulated by well-de®ned targeting signals. Targeting signals of secreted proteins, membrane proteins, and most of the nuclear-encoded organelle proteins are often encoded in N-terminal extensions of the protein (Pugsley, 1989; von Heijne, 1994a; Claros et al., 1997). After or during protein translocation across a compartment membrane the targeting signal is removed by speci®c signal peptidase activity releasing the mature protein (von Heijne, 1994b). Although sequence identity between targeting sequences and their cleavage sites is very low it is possible to extract relevant features by neural network systems (von Heijne, 1983, 1985; Schneider and Wrede, 1993a; Nielsen et al., 1997). Prediction systems for both targeting signals and signal peptidase recognition sites have been developed. A ®rst attempt towards identi®cation of cleavage site sequences yielded a prediction accuracy of 97% in an independent test set (Schneider and Wrede, 1992, 1993a, 1994a). Amino acid sequences were encoded by selected physicochemical properties, and the network was trained by an evolutionary algorithm. The ®lter system was successfully used as ®tness function for de novo design of idealized signal peptidase target sites (Schneider and Wrede, 1994a; Schneider et al., 1995a; Wrede et al., 1998). A recent successful study using neural networks for the identi®cation of secretory signal peptides and signal anchors has been performed by the groups of von Heijne and co-workers (Nielsen et al., 1997). These authors used bp-networks lacking hidden units or containing one layer of 2 to 10 hidden units. Sequence data were presented to the network in conventionally encoded moving windows (Qian and Sejnowski, 1988). Main emphasis was put on the collection of a proper training set for prediction of cleavage sites against the background of all other sequences and classi®cation of amino acids as belonging to the signal peptide or not. For the ®rst task the `C-score' and for the second the `S-score' were calculated by the networks. An elegant combination of the S- and C-scores lead to the `Y-score' which proved to provide useful predictions for sequences of gram-negative bacteria. For eukaryotic and gram-positive bacterial sequences the prediction accuracy was lower. Application of the prediction system to the whole genome of Haemophilus in¯uenzae yielded 14% potentially secreted proteins (Nielsen et al., 1997). Recently an analysis using Kohonen-networks for clustering mitochondrial targeting signals has been performed (Schneider et al., 1998). Sequences were encoded by physicochemical properties. It turned out that there are three major classes of import signals which can be characterised by the occurrence of arginine in either one of the positions ÿ10, ÿ3 or ÿ2 relative to the signal peptidase processing site, thereby substantiating previous statistical ®ndings (Gavel and von Heijne, 1990). These three cleavage site classes do not seem to be speciesspeci®c. Prediction of mitochondrial signal peptidase cleavage site positions by supervised networks failed due to a large amount of over-prediction; it is concluded that additional nonlocal sequence information is required to recognize the experimentally observed cleavage site from the set of candidates. Classi®cation of N-terminal sequence parts by unsupervised neural networks might help deciphering the required relevant features (Schneider and Broger, 1998).
208
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
4.1.7. Peptide±protein binding Many reactions in the living cell are carried out via ligand±protein interactions. Especially the immune system is based on the recognition of many dierent peptides with the aim to distinct self and non-self molecules. Prediction of peptide binding to MHC I without knowing the three-dimensional ligand structure is a typical task in drug design. A recent work by Gulukota et al. (1997) describes a comparison of two dierent methods, neural networks and a polynomial method, for predicting speci®city and sensitivity of peptide binding to HLA-A.2.1. It turned out that neural networks were superior to the applied statistical method. Several predictions were determined as for speci®city, sensitivity, positive predicted value (PPV), and negative predicted value (NPV). Sensitivity and PPV increased with the ratio to accept weak binders. For speci®city and NPV the polynomial method leads to slightly better values than the network approach. To test whether the peptide training set was representative of the problem the ANN were trained by increasing numbers of training data. There was no increase in speci®city compared to a small data set, but sensitivity increased signi®cantly.
4.1.8. Identi®cation of glycosylation sites A large number of eukaryotic proteins contain glycosylated serine or threonine residues. Glycosylation patterns are important for cell±cell recognition in tissues and for cell development (Fukuda, 1991; Hart, 1992; Muramatsu, 1993). The enzyme reaction mechanism catalyzing the covalent linkage between the OH-group of the amino acid and the sugar moiety N-acetylgalactosamine (GalNAc) is still unknown. Identi®cation of glycosylation sites of a protein can be done chemically by Edman degradation, which is very time-consuming. Since many protein primary structures are derived from DNA sequences prediction tools for the identi®cation of O-glycosylation sites can be useful. There are only two reports applying arti®cial neural networks for GalNAc transferase reaction sites (Hansen et al., 1995; Cai et al., 1997). A report by Brunak and co-workers (Hansen et al., 1995) gives a detailed description of statistical analysis and development of several arti®cial neural network types as O-glycosylation site predictors. Special care was taken to assemble a proper training data set based on experimentally veri®ed O-glycosylation sites and low sequence identities. Several supervised neural networks with a single hidden layer containing 0±15 hidden units were trained by conventional bp. The sequence windows were symmetric, ranging from 3 to 49 residues. Amino acid sequences were represented as bit strings. The acceptor speci®city of the GalNActransferase prefers threonine, serine and proline upstream of position ÿ4 and downstream of position +4. This pattern was not found in non-glycosylated examples. Cysteins and tryptophans adjacent to the glycosylation sites as in position ÿ2 to +2 or in ÿ1 to +1 have not been found. Most glycosylation sites were observed in the N-terminal part of the proteins. The prediction accuracy for glycosylated threonine is higher than for glycosylated serine. For a `threonine net' the optimal window size was 17 residues, yielding a Matthews correlation coecient of cc = 0.42 in a low similarity test set and cc = 0.88 in a high similarity test set. For the `serine net' correlation coecients of cc = 0.33 and cc = 0.7 were found respectively. Compared to other statistical analysis like weight matrices the neural networks showed the best prediction performance.
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
209
4.2. Neural networks as ®tness functions in drug design Besides target identi®cation and prediction of protein structure the predominant current applications of ANN in the drug development process are: (i) automatic clustering of compound libraries by self-organizing networks (Kohonen and related approaches, ART systems), and (ii) SAR modelling employing supervised techniques (bp-networks, RBF networks). The ®rst successful attempts at self-determining compound classi®cation according to molecular activity by Kohonen-networks were performed by Gasteiger and co-workers (Zupan and Gasteiger, 1993; Gasteiger and Li, 1994; Gasteiger et al., 1994). They used the electrostatic surface potentials of molecules as the basis for classi®cation. This approach has been further developed since then (Bauknecht et al., 1996; Barlow, 1995; Holzgrabe et al., 1996; Anzali et al., 1997; Polanski, 1997). Independent of the choice of the molecular representation the basic idea using UNN is to generate one or more clusters containing the `active' compounds, and other clusters containing the rest of the molecules contained in the initial data set. This approach was also used to estimate the diversity of compound libraries (Sadowski et al., 1995). The clusters generated by the UNN are characterised by the respective features extracted, i.e. the associated weight vectors. The weight vectors can be used as feature templates for further in silico data base screening. This can drastically reduce the screening eort needed to obtain lead compounds. For example, one might start o with the construction of a small compound library by a simple combinatorial approach. This limited set of molecules is tested for a desired activity, e.g. enzyme inhibition (ki or IC50 measurements), and in parallel the data set is clustered by a self-organizing network. If the majority of the inhibitory compounds form a single cluster or a group of neighboring clusters in the neuron map the associated weight vectors serve as a template for data base screening. The resulting hits are those data base entries that are classi®ed by the network as belonging to one of the `active' clusters containing the already known inhibitors. This example demonstrates one possibility how high-throughput screening and ANN can complement each other. We followed this approach for the identi®cation of potential new thrombin inhibitors in a large data base: to construct an initial small data set 409 four-component Ugi reactions were performed in a combinatorial approach, and the products were tested for their ability to inhibit thrombin enzymatic activity (Weber et al., 1995). For feature extraction by a Kohonennetwork each compound was described by the cross-correlation of hydrogen-bond donors, hydrogen-bond acceptors, lipophilic and charged groups (cf. Section 3). The resulting clustering in the output layer of the network is shown in Fig. 12. White neurons contain only compounds with ki>10 mM (de®ned as `inactive'), the darker a neuron is drawn the more Ugi products with kiR10 mM (de®ned as `active' inhibitors) are grouped together. Neuron (6/3) contains the largest fraction of inhibitors, and the weights of neuron (6/3) were used to search the MedChem data base (version 1997, as distributed by Daylight Chemical Information Systems, Irvine) containing over 33,000 compounds for new potential inhibitors. Many known thrombin inhibitors were assigned to neuron (6/3), the highest ranking known inhibitor was PPACK which irreversibly binds to the enzyme (Kettner and Shaw, 1981, for a discussion of thrombin active site features, see (Grootenhuis and Karplus, 1996)). Network development and the computer-based data base search took only a few minutes on Roche in-house software. An
210
G. Schneider, P. Wrede / Progress in Biophysics & Molecular Biology 70 (1998) 175±222
Fig. 12. A planar 7 7 Kohonen map of thrombin inhibitors. Neurons are drawn as squares. Four-component Ugiproducts were clustered by a Kohonen-network. The localization of compounds with ki