A dynamic architecture for artificial neural networks

June 9, 2017 | Autor: Hassine Saidane | Categoria: Clustering and Classification Methods
Share Embed


Descrição do Produto

ARTICLE IN PRESS

Neurocomputing 63 (2005) 397–413 www.elsevier.com/locate/neucom

A dynamic architecture for artificial neural networks M. Ghiassia,, H. Saidaneb a

Operations & Management Information Systems, Santa Clara University, Santa Clara, CA 95053-0382, USA b Data Mining Consultant, 12441 Shropshire Ln., San Diego, CA 92128, USA

Received 7 March 2003; received in revised form 10 February 2004; accepted 24 March 2004 Communicated by T. Heskes Available online 20 August 2004

Abstract Artificial neural networks (ANN), have shown to be an effective, general-purpose approach for pattern recognition, classification, clustering, and prediction. Traditional research in this area uses a network with a sequential iterative learning process based on the feed-forward, back-propagation algorithm. In this paper, we introduce a model that uses a different architecture compared to the traditional neural network, to capture and forecast nonlinear processes. This approach utilizes the entire observed data set simultaneously and collectively to estimate the parameters of the model. To assess the effectiveness of this method, we have applied it to a marketing data set and a standard benchmark from ANN literature (Wolf’s sunspot activity data set). The results show that this approach performs well when compared with traditional models and established research. r 2004 Elsevier B.V. All rights reserved. Keywords: Artificial neural networks; Forecasting; Feed-forward back-propagation networks

1. Introduction Artificial neural networks (ANN), have shown to be an effective, general-purpose approach for pattern recognition, classification, clustering, and prediction. Corresponding author. Tel.: +1-408-554-4687; fax: +1-408-554-5157.

E-mail addresses: [email protected] (M. Ghiassi), [email protected] (H. Saidane). 0925-2312/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2004.03.014

ARTICLE IN PRESS 398

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

Researchers have applied ANN methods to a wide variety of fields, ranging from science to business and engineering. The most commonly used approach to ANN learning is the ‘‘feed-forward back-propagation’’ (FFBP) algorithm, which was first introduced in [17,18]. ANN architecture is generally described as an arrangement of interconnected nodes organized into three groups. The first group consists of the input nodes that accept external information into the model. The second group forms the ‘‘hidden layers’’ of the network. Each hidden layer is a set of nodes that receive their input from the previous layer, perform some processing (combination and transfer) and feed an output to the nodes in the next layer. The third group represents the system output; node(s) in this set process the transferred values from the last hidden layer (Fig. 1). Most ANN models use a multilayer network with one or more hidden layers. The parameters of the model such as the choice of input nodes, number of hidden layers, number of hidden nodes (in each hidden layer), and the form of transfer functions, are problem dependent and often require trial and error to find the best model for a particular application. The choice of the optimization algorithm and guidelines for selecting the parameter values to yield a better model have been studied and reported upon extensively in literature [1,7,11,13,15,24]. Chester [4] and Zhang [27] studies suggest that the ideal number of hidden layers in FFBP architectures is often one or two. However, there are fewer guidelines on the optimal number of hidden nodes for each hidden layer. Researchers often need to experiment with varying numbers of hidden nodes in order to define this parameter. Some have suggested n=2 [14], n [21], 2n [23], or 2n þ 1 [8] nodes for each hidden layer, where n is the number of input nodes. Methods for examining and reducing the number of input and hidden nodes have also been studied [5,16]. ANN models are capable of dealing with linear, near-linear or nonlinear functional relationships [12], although they have shown to be most successful with nonlinear processes [25]. One major application of ANN is in the area of forecasting [19]. Many researchers have applied ANNs to both linear and nonlinear forecasting problems [22,9]. Zhang [25] provides an excellent review of the application of ANN to forecasting.

Fig. 1. A traditional neural network model with two hidden layers.

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

399

In this paper, we present a new approach to ANN that differs significantly from traditional FFBP models. In Section 2, we present the architecture of this model and discuss its properties. The model’s procedure and learning algorithm are discussed in Sections 3 and 4. Section 5 presents model properties and Section 6 discusses computational results. Conclusions are presented in Section 7.

2. Model architecture We present a dynamic architecture for artificial neural networks (DAN2) that is designed to cope with the major challenges facing neural network modeling in capturing and forecasting nonlinear processes. The general philosophy of this model is based on the principle of learning and accumulating knowledge at each layer, propagating and adjusting this knowledge forward to the next layer, and repeating these steps until the desired network performance criteria are reached. We therefore classify DAN2 as a purely feed-forward model. As in classical neural networks, the DAN2 architecture is comprised of an input layer, one or more hidden layers and one output layer. The input layer accepts external data to the model. Once the input nodes have been identified, all observations are used simultaneously to train the network. To guard against the problem of differing dimensional metrics, we scale the data using a normalization approach. The next modeling decision is the choice of the number of hidden layers and hidden nodes. In this model, unlike classical neural nets, the number of hidden layers is not fixed a priori. These are sequentially and dynamically generated until a level of performance accuracy is reached. Additionally, the proposed approach uses a fixed number of hidden nodes (four) in each hidden layer. This structure is not arbitrary, but justified by the estimation approach. At each hidden layer, the network is trained using all observations in the training set simultaneously, so as to minimize the training SSE or MSE. As depicted in Fig. 2, each hidden layer is comprised of four nodes. The first node is merely a constant (e.g. 1) input node, referred to as the C node. The second node is a function that encapsulates the ‘‘current accumulated knowledge element’’ acquired during the previous training step, and is referred to as a ‘‘CAKE’’ node. The third and fourth nodes represent the current residual (remaining) nonlinear component of the process, captured through a transfer function using a weighted and normalized sum of the input variables. Such nodes represent the ‘‘current residual nonlinear element’’ and will be referred to as the ‘‘CURNOLE’’ nodes. A CAKE node receives its input from all nodes in the previous layer. The CURNOLE nodes, on the other hand, receive input from only the previous layer’s CURNOLE nodes. Finally, the output layer receives its input from all nodes in the last hidden layer. The training process begins with a special layer where the CAKE node captures the linear component of the input data. Thus, its input (content) is a linear combination (weighted sum) of the input variables and a constant input node. These weights are easily obtainable through classical linear regression. If the desired level of

ARTICLE IN PRESS 400

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

Fig. 2. The DAN2 network architecture.

accuracy is reached we can conclude that the relationship is linear and the training process stops. For nonlinear relations additional hidden layers are required. At each subsequent layer the input to the CAKE node is a weighted sum (linear combination) of the previous layer’s CAKE, CURNOLE, and C nodes. For each hidden layer, the parameters in the CURNOLE nodes and their respective weights are designed to capture as much of the remaining (nonlinear) part of the process as possible. Throughout training, the CAKE nodes carry an adequate portion of learning achieved in previous layers forward. This process ensures that the performance or knowledge gained so far is adjusted and improved but not lost. The CURNOLE node parameters are determined through either a comprehensive search or through a nonlinear optimization process. The weights associated with the CURNOLE nodes are easily obtained through linear regression using three independent (one CAKE and two CURNOLE) variables. The resulting proposed architecture is illustrated in Fig. 2. In this network, the ‘‘I’’ node represents the input, the ‘‘C’’ nodes are the constant nodes, the ‘‘G k ’’ and ‘‘H k ’’ nodes represent CURNOLE nodes, and the ‘‘F k ’’ nodes are the CAKE nodes. The final CAKE node represents the dependent variable or the output. Similar to classical neural nets, in this architecture nodes perform both the combination and transfer functions. After capturing the linear component of the input data in the first CAKE node, the algorithm transforms the input data set to model the nonlinearity of the process in subsequent iterations. DAN2 uses a vector projection approach to perform data transformation. The transformation process defines a reference vector R ¼ frj ; j ¼ 1; 2; . . . ; mg and projects each observation record onto this vector to normalize the data as discussed in Section 3. This normalization defines an angle, ai , between record i and the reference vector R. DAN2 uses the set of ai ’s to train the network, and updates their values at each iteration. In Section 3,

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

401

we show that this normalization can be represented by the trigonometric function Cosine ðmk ai þ yÞ. At each layer of the architecture, we revise (mk ai þ y) and measure the impact of this change on the output value. The modification of the angle (mk ai þ y) is equivalent to rotating and shifting the reference vector, thus changing the impact of the projected input vectors and their contribution to the output for that iteration. Using trigonometric transfer functions to capture a nonlinear process is analogous to using a ‘‘generalized’’ Fourier series for function estimation, which has shown to be an effective kernel for approximating nonlinear functions [2]. The generalized form of the Fourier series uses many terms (similar to harmonics). The number of terms in the series depends on the complexity and the degree of nonlinearity of the process. Similar to the Fourier series, in every layer DAN2 introduces a Cosine (mai þ y) function to capture the remaining nonlinearity. In this function, ai represents the transformed and normalized input variable, m represents a constant multiplier, and y represents a constant shift. This formulation uses two (nonlinear) parameters m and y. The latter can be replaced through the use of two trigonometric functions of the form A Cosineðmai Þ þ B Sineðmai Þ: We use this functional form as the transfer function in our model. The two CURNOLE nodes in Fig. 2 represent this formulation, and are used to reduce the number of nonlinear parameters from two to one, making the estimation of the resulting nonlinear formulation easier. At any given layer, if the Cosine(mai þ y) terms captured in previous layers do not adequately express the nonlinear behavior of the process, a new layer with an additional set of nodes is automatically generated, including a new Cosine (mai þ y) term. This process is analogous to how the Fourier series adds new terms to improve function approximation. Thus, the number of layers in DAN2 architecture is dynamically defined and depends upon the complexity of the underlying process and the desired level of accuracy. This architecture differs from the traditional multi-layer ANN model in a number of ways. The first difference is in the utilization of the input records. Traditionally, input records are processed one at a time. DAN2 uses the entire set of records simultaneously and repeatedly at every layer. This global view of the process provides a training environment that ensures monotonically increasing learning. The second difference is in the choice of the transfer function. DAN2 uses the trigonometric cosine function instead of the traditional sigmoid function to capture the nonlinearity of the process. Estimation of the nonlinear component is partitioned into successive layers. At each layer the transfer function introduces one nonlinear parameter only, which is optimally or experimentally determined. This partitioning reduces solution complexity for each layer, while maintaining the acquired knowledge from previous layers. The third difference is in the structure and number of hidden layers. In traditional models, the number of hidden layers and hidden nodes is experimentally determined. In our model, the number of hidden nodes is fixed at four. The model determines the

ARTICLE IN PRESS 402

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

number of hidden layers dynamically to achieve specified performance criteria. Model structure selection is a major challenge for ANN researchers. Traditional methods require the modeler to define and determine the right input variables, correct number of layers, and the optimal number of hidden nodes in each layer. Our model reduces this challenge to only one decision: the selection of input variables. Once inputs are selected, DAN2 minimizes a measure of network performance (such as SSE or MSE) by introducing hidden layers dynamically. Final model selection also uses the stopping rules discussed in Section 4, which are applied to avoid model under or over-fitting. Finally, network connectivity in traditional ANN uses many-to-many relationships among interior nodes. This connectivity requires many arcs, which can result in a complex architecture. DAN2 architecture reduces this complexity by allowing many-to-one relationships only. Since each layer uses four nodes and has fewer arcs, the computational requirements at each layer are reduced. In the next section, we present the learning algorithm that is employed in this architecture.

3. The dynamic learning algorithm Define the input matrix X ¼ fX i ; i ¼ 1; 2; . . . ; ng as n independent records of m attributes. Let X i ¼ fxij ; j ¼ 1; 2; . . . ; mg. Also define a ‘‘reference’’ vector R ¼ frj ; j ¼ 1; 2; . . . ; mg as a vector of constant values. The algorithm steps are as follows: 1. The initial linear layer: X F 0 ðX Þ ¼ a0 þ b0j xij :

ð1Þ

j

2. Subsequent hidden layers’ CAKE node (at iteration k): F k ðX i Þ ¼ ak þ bk F k1 ðX i Þ þ ck G k ðX i Þ þ d k H k ðX i Þ:

ð2Þ

3. The CURNOLE node’s input and transfer function at iteration k (k ¼ 1; 2; . . . ; K; where K is the maximum sequential iterations or number of hidden layers) is defined as: a. Specify a random set of m constants representing the ‘‘reference’’ vector R (default rj ¼ 1 for all j ¼ 1; 2; . . . ; m). P b. For each input record X i , compute the scalar product: ðR  X Þ ¼ ð j rj xij Þ. c. Compute the length P (norm) of the Pvector R and a record vector X i : kRk ¼ SQRTð j r2j Þ; kX i k ¼ SQRTð j x2ij Þ. d. Normalize ðR  X Þ to compute ðR  X ÞN : ðR  X i ÞN ¼ ðR  X i Þ=ðkRk kX i kÞ. Recall that: ðR  X i Þ ¼ ðkRk  kX i kÞ  CosinefangleðR; X i Þg thus, CosinefangleðR; X i Þg ¼ ðR  X i Þ=ðkRk  kX ki Þ ¼ ðR  X i ÞN . e. Compute angleðR; X i Þ: angleðR; X i Þ ¼ ArcfCosineðR  X i ÞN Þg ¼ ai for i ¼ 1; 2; . . . ; n.

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

403

f. Compute the transferred nonlinear component of the signal as: G k ðX i Þ ¼ Cosineðmk  ai Þ, H k ðX i Þ ¼ Sineðmk  ai Þ, where, mk is a constant multiplier for iteration k. g. Replacing G k ðX i Þ and H k ðX i Þ in Eq. (2) will result in F k ðX i Þ ¼ ak þ bk F k1 ðX i Þ þ ck Cosineðmk  ai Þ þ d k Sineðmk  ai Þ:

ð3Þ

4. The training process The training process begins with a linear regression phase to extract the linear component of the process, if any. To compute the nonlinear component of the process, the algorithm uses the formulation in Eq. (3) at each hidden layer. This formulation is linear in the parameter set Ak , where Ak ¼ fak ; bk ; ck ; d k g and nonlinear in parameter mk . This suggests a training strategy composed of two stages: a linear stage and a nonlinear stage. In the linear stage, a simple ordinary least squares (OLS) is used to estimate the parameter set Ak for a given value of mk . The nonlinear stage requires a search in the nonlinear space of mk ð0pmk pmmax Þ. The product (mk  ai ) defines an angle between the observation record X i and the reference vector R. The angle and corresponding trigonometric functions perform the combination and transfer and thereby are used to measure a generalized distance between each observation vector X i and the reference vector R. The objective at each layer is to minimize the SSE as represented by these distances. Therefore at each layer, similar to OLS, vector R is rotated and shifted so as to minimize the resulting total error. These rotations and shifts are represented in the manipulation of mk . The search for mk values alters the product (mk  ai ) and changes the contribution of input vector X i to the output through the G k ðX i Þ and H k ðX i Þ components. Once the best value of mk for layer k is determined, stopping rules are examined. If the stopping criteria are met, training terminates; otherwise, a new layer is introduced and the process repeats. As in classical neural net models, the objective of training is to find the parameter values that capture an adequate level of learning using a measure of model fit, such as SSE, MSE, R2 , etc. Similar to traditional methods, we use a data set for training and validation (in-sample data set) and a different data set for testing (out-of-sample data set). The training is performed one layer at a time using all training observations simultaneously. If the performance of the model is deemed satisfactory at the current layer, the training stops. As previously mentioned, this layer-centered training is decomposed into a linear and a nonlinear stage to increase the efficiency of solving highly nonlinear equations. The entire training data set is collectively used in both the linear and nonlinear stages. The linear stage uses OLS. For the nonlinear stage, we propose four different solution strategies. The first strategy uses a comprehensive grid line search over the range of mk to find the value that result in the best fit. This is a simple, but exhaustive search in the mk space that could be computationally intensive if the mk range is wide and the grid granularity is small. However, exhaustive searches have a higher chance of finding the global optima.

ARTICLE IN PRESS 404

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

The second strategy uses a derivative based optimization approach for minimizing SSE with respect to the Ak and mk variables. This approach starts with a given value of mk and uses OLS to estimate the parameters in Ak . Next, these estimated values are used to minimize SSE with respect to mk . This step requires finding zeros of the nonlinear trigonometric equation (4) in mk . At each iteration k, we define: SSEk ¼ Si ½F k ðX i Þ  F ^ ðX i Þ 2 where F k ðX i Þ is defined by Eq. (3) and F ^ ðX i Þ are the observed values. To obtain the minimum SSEk we use: d=dmk ðSSEk Þ ¼ d=dmk Si ½ak þ bk F k1 ðX i Þ þ ck  Cosðmk  ai Þ þ d k  Sinðmk  ai Þ  F ^ ðX i Þ 2 ¼ Si ½ðck  ai  Sinðmk  ai Þ þ d k  a1  Cosðmk  ai ÞÞ  ðak þ bk F k1 ðX i Þ þ ck  Cosðmk  ai Þ þ d k  Sinðmk  ai Þ  F ^ ðX i ÞÞ ¼ Si ½ðF ^ ðX i Þ  ak  bk F k1 ðX i ÞÞ  ðck  a1  Sinðmk  ai Þ  d k  a1  Cosðmk  ai ÞÞ  c2k  a1  Sinðmk  ai Þ  Cosðmk  ai Þ þ d 2k  a1  Sinðmk  ai Þ  Cosðmk  ai Þ þ ck  d k  a1  Cos2 ðmk  ai Þ  ck  d k  a1  Sin2 ðmk  ai Þ : Using Sinð2  AÞ ¼ 2  SinðAÞ  CosðAÞ and Cosð2  AÞ ¼ Cos2 A  Sin2 A, we get d=dmk ðSSEk Þ ¼ Si ½ðF ^ ðX i Þ  ak  bk F k1 ðX i ÞÞ  ðck  a1  Sinðmk  ai Þ  d k  a1  Cosðmk  ai ÞÞ þ ðd 2k  c2k Þ  a1  Sinð2  mk  ai Þ=2 þ ck  d k  a1  Cosð2  mk  ai Þ ¼ 0:

(4)

We use the bisection method to solve this equation and find values of mk . We then use the second derivative to select candidate minimum values. From this set of candidates, we choose the mk that yields the smallest SSEk . The process alternates between linear and nonlinear steps until a desired level of accuracy or fit is achieved. The third strategy uses a non-derivative based optimization method for solving the nonlinear stage in the single variable mk . Efficient one-dimensional optimization methods, such as the golden search, the DSC-Powell and others [10], can be used to obtain the best value of mk . Brent [3], presents a one-dimensional optimization method that performs well on the class of Fourier kernels similar to Eq. (3). The solution found in this approach, however, is not guaranteed to be the global optima. Finally, a more direct approach is to use a one-stage approach for simultaneous estimation of the Ak and mk parameters. This can be done using unconstrained optimization algorithms such as the gradient or conjugate gradient methods, Fletcher–Reeves, or Newton methods [10]. The nonlinear least square method can also be used to estimate these parameters.

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

405

4.1. DAN2 model selection and stopping criteria An important challenge in any ANN modeling approach is the selection of a suitable model for process representation. Modelers examine the process, its data, and use theoretical information to hypothesize one or more models to capture the behavior of the underlying process. When established theoretical relationships exist, they serve to guide the model selection process. However, when no validated theoretical relationship is present, modelers often hypothesize a series of ANN models with different numbers of hidden layers and/or hidden nodes and use the insample data for model selection and validation. Traditionally during the model selection process, the in-sample data is partitioned into two sets: the training and the validation data sets. The validation data set is used to ensure the model is properly trained, while preventing under and over fitting to the training set. Model performance criteria are used to calibrate and compare the various hypothesized models using both the training and validation data sets in order to select the best model. Since DAN2 uses a fixed number of hidden nodes (four), sequentially and dynamically adding new layers until a user specified accuracy measure (such as SSE or MSE) or a maximum number of iterations are attained, the model selection process is reduced to the selection of inputs. The challenge in DAN2, as in other methods, is to adequately train the model while avoiding under/over fitting. Addressing this challenge thus requires the development of additional stopping criteria that take these concerns into account. If the model training stops too early, the network is said to be ‘‘under-trained’’ or ‘‘under-fitted.’’ An under-trained model often has high SSE values for either or both the training and validation data sets. Under-training often occurs when there are insufficient data for model fitting. DAN2 uses 1 ¼ ðSSEk  SSEkþ1 Þ=SSEk p1 to assess existence or absence of under-training in our models. Over-training or over-fitting is a more common problem in neural net modeling. A neural net model is considered over-fitted (over-trained) when the network fits the insample data well but produces poor out-of-sample results. To avoid over-fitting, we divide the available in-sample data into the training and validation data sets. At each iteration k, ðk41Þ, we compute MSE values for both the training ðMSET Þ and validation ðMSEv Þ sets. We use 2 ¼ jMSET  MSEv j=MSET p2 to guard against over-fitting. The model is considered fully trained when the user specified accuracy criteria and the over fitting constraint are both satisfied. When a user specifies a relatively small accuracy measure, it may be possible to reach the over fitting criterion 2 before reaching the desired level of accuracy. The modeler then needs to reexamine and revise the desired level of accuracy in order to avoid over fitting. For a class of problems, such as in electrical load forecasting, a desired level of accuracy is often pre-specified. Achieving such targets, for training and/or forecasting, can also be used as a stopping rule. The accuracy levels 1 and 2 are problem dependent and are determined experimentally. Finally, DAN2 offers the total number of iterations as a stopping rule for cases that do not follow the previous scenarios.

ARTICLE IN PRESS 406

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

5. Iterative improvement and properties The algorithm employed in this formulation uses ordinary least squares (OLS) and an optional optimization (bisection or Newton) method to reduce the residual error in every iteration. This method begins with linear formulation ðk ¼ 0Þ and uses OLS to estimate the parameters. If the desired accuracy is reached, the algorithm stops. Otherwise, the output value is used for the next iteration ðk ¼ 1Þ and DAN2 assumes the presence of a nonlinear relationship as described by Eq. (3). Since the relationship is nonlinear and a closed form solution is often hard to obtain, we use a two-stage iterative approach to estimate the parameters of Eq. (3). For k ¼ 1, we set m1 ¼ 1 and use the OLS method to estimate the remaining parameters. We then use these parameters and one of the second stage optimization strategies defined earlier to find the m1 that minimizes total training error. For the next iteration ðk ¼ 2Þ, we set m2 ¼ m1 and use the OLS method to update the parameters of Eq. (3), then repeating the second stage of the process to find m2 . This process continues until one of the stopping rules (accuracy desired or maximum number of iterations) is reached. Model property 1: Improvement and non-deterioration of learning through mk . We can show that the algorithm used to estimate the best parameter values for Eq. (3) results in a reduction of residual error in every iteration. Let us define S k ðA; mk Þ ¼ Si ðF k ðxi Þ  F ^ ðxi ÞÞ2 where Ak ¼ fak ; bk ; ck ; d k g for iteration k. In stage 1, we set mk and use OLS to minimize S k to estimate parameters in A. Let us call this set of parameters A1 . Clearly, for this value of mk ¼ m1 we have S k ðA1 , m1 ÞpS k ðA; mk Þ. In stage 2, we minimize S k ðA1 ; mk Þ with respect to mk to obtain m1 (using strategies described in Section 4). This is a single variable, nonlinear minimization problem with many Cosine (mk  ai ÞÞ and Sine ðmk  ai Þ terms. If there are multiple solutions, the value of mk that yields the smallest S k is selected. Once again S k ðA1 ; m1 ÞpSk ðA1 , m1 ÞpSk ðA; mk Þ, meaning that after every iteration of the algorithm error is reduced (or kept the same). Model property 2: Bounding the value of mk . This parameter is a scalar value, and together with ai represents the angle between the reference vector and each of the observation vectors. Without a loss of generality, we assume that the resulting angles are between 0 and 360 . We can therefore write 0 omk  ai o360 . To allow for the smallest angle to reach its full value we define the upper bound for mk as the following: mk ¼ maxf360 =ai for i ¼ 1; 2; . . . ; ng ¼ mmax : Bounding 0omk ommax allows us to use the bisection method to solve the resulting nonlinear relation in stage 2 more effectively. If the comprehensive grid line search approach is used, the search in the mk domain can benefit from this bound as well. Model property 3: Number of parameters in the model. In each iteration, five parameters need to be estimated (ak , bk , ck , d k , and mk ) to fit the model. These parameters are then used to compute an improved output for that layer. The new output value and the transferred input values, ai , are then used as the input for the next layer. At each iteration (layer), the algorithm computes these parameters to

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

407

obtain a better fit (lower SSE). The process repeats until one of the stopping criteria is reached. Model property 4: Unlimited capacity for problem dimension. The proposed model is insensitive to the number of input variables. Unlike classical neural net models, which can be hampered by a large number of input variables, the DAN2 model beyond the linear stage compresses any number of variables into just one, the a angle.

6. Model performance and evaluation We have used an implementation of the algorithm presented in Section 3, which employs the comprehensive grid line search strategy discussed in Section 4, to evaluate the performance of this method. We now present the results of applying this method to two data sets, including the well-known Wolf’s sunspot activity data set. We first present the sunspot results and then provide a comparative analysis with a traditional ANN system that uses the FFBP method. 6.1. Example 1: Wolf’s sunspot activity data set The sunspot series is an annual record of visible spots on the face of the sun from the year 1700 onward. The data set is nonlinear, non-Gaussian, and has traditionally been used to measure the effectiveness of nonlinear statistical models. In particular, the period of 1700–1920 (221 points) has become a standard set used by many researchers [5,6,26] to evaluate the performance of various forecasting methods, including linear and nonlinear regressions, and neural network models. Cottrell et al. [5] report the results of applying several regression and traditional neural network models to the sunspot data. Their analysis used the period of 1700–1920 for training and 1921–1955 (35 points) for forecasting. They report on two regression models (models A and B, Table 1) and five neural network models (models C–G, Table 1). The first four neural network models (models C, D, E, and F, Table 1) have 4 input units (the previous four years values are used as lagged

Table 1 Sunspot MSE values for traditional models Model

MSE training

MSE test

(A) (B) (C) (D) (E) (F) (G)

206 195 112 123 125 170 135

214 N/A N/A 129 N/A N/A N/A

ARTICLE IN PRESS 408

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

inputs), one output unit, and a single hidden layer with 3–5 hidden nodes. The last neural network model (model G, Table 1) uses lag years 1, 2, 9, and 11 as the input vector, one output unit, and one hidden layer with only 2 hidden nodes. These models and their corresponding training set MSEs are given in Table 1. The MSE values range from 112 (neural net model C) to 206 (an auto regression model). The residual variance for the test set (1921–1955), for the neural net model (D) is 129 and for the auto regression model (A) is 214. Residual test values for the other models are not reported. Zhang [26] has also used this data set to evaluate the effectiveness of an ANN, ARIMA, and hybrid model. He has used the data for the period of 1700–1920 for training and 1921–1987 (67 points) for testing. He has used two metrics, MSE and MAD, to measure the performance of each of the three models. These results are presented in Table 2. The sunspot series has been extensively studied by many researchers and models that effectively describe the behavior of the data have been proposed. Traditionally sunspot researches have reported their results for in-sample (training) and out-ofsample (testing) only. In the studies cited here, no additional data set is used to further assess and validate model selection. We also follow the same partitioning of the data and report training and testing results. This partitioning allows us to directly compare DAN2 results against existing studies. We present three cases of sunspot data. In the first case, we use the data from 1700–1920 for training, using the previous four years as the lagged input variables. We then applied this trained model to forecast one-step-ahead for the next 35 and 67 years. These results are presented in Table 2 (DAN2-M1) and depicted in Fig. 3, with R2 ¼ 0:911 for the training set. The low training MSE of 105 and MAD of 7.5 reflects an excellent fit, visible in Fig. 3. Fig. 4 presents the forecasted values for 35 and 67 points ahead. We observed the variance of the data for the period 1700–1920 to be 1173, whereas the variance for the next 35 and 67 years increases by 43% (to 1673) and 109% (to 2453), respectively. This may explain the higher forecasting MSE values in all models. We also observed that the data has a pronounced peak of 80 years. In the second case, using model DAN2-M1, we changed the training set data to the period of 1700–1791 so as to include this peak only once. With this change, the MSE for the training set was lowered to 95 with an R2 ¼ 0:930, and the MSE for the next 35 points was reduced to 98. Table 2 Sunspot MSE and MAD values for DAN2 and other models Model

DAN2-M1 DAN2-M2 SPSS ARIMA ANN Hybrid MSE (MAD) MSE (MAD) MSE (MAD) MSE (MAD) MSE (MAD) MSE (MAD)

Training 105 (7.5) 35 points ahead 197 (10.2) 67 points ahead 286 (11.9)

95 (7.4) 146 (9.6) 266 (12.3)

175 (10.1) 167 (10.1) 359 (14.1)

N/A 217 (11.3) 306 (13.0)

N/A 205 (10.2) 351 (13.5)

N/A 187 (10.8) 280 (12.8)

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

409

DAN2 Model 1 Sunspots Training 200 180 160

Sunspots

140 120 100 80 60 40 20 0 1704 1714 1724 1734 1744 1754 1764 1774 1784 1794 1804 1814 1824 1834 1844 1854 1864 1874 1884 1894 1904 1914 1924 1934 1944 1954 1964 1974 1984

-20

Year Actual

DAN2 M1

DAN2-M 1 DAN2 Model 2 Sunspots Training 200 180 160

Sunspots

140 120 100 80 60 40 20 0 1711 1721 1731 1741 1751 1761 1771 1781 1791 1801 1811 1821 1831 1841 1851 1861 1871 1881 1891 1901 1911 1921 1931 1941 1951 1961 1971 1981

-20

Year Actual

DAN2 M2

DAN2-M 2 Fig. 3. Sunspot activity plots (training).

The third case uses an alternate model with input consisting of the lagged years 1, 2, 9, and 11. The results of this model, using the training period (1700–1920), are also presented in Table 2 (DAN2 M2). The training set MSE value of 95 is lower than any other model, and is more favorable (by 30%) than what is reported by Cottrell for the ANN model that uses the same inputs (model G in Table 1; MSE=135). This is by far the best performing model. The forecast MSE values for 35 and 67 points ahead are also the lowest reported and the associated MAD values are noticeably

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

410

DAN2 Model 1 Sunspots Forecasted 200

Sunspots

180 160 140 120 100 80 60 40 20 0

1921

-20

1931

1941

1951

1961

1971

1981

Year Actual

DAN2 M1

DAN2-M1 DAN2 Model 2 Sunspots Forecasted

200 180

Sunspots

160 140 120 100 80 60 40 20 0

1921

-20

1931

1941

1951

1961

1971

1981

Year Actual

DAN2 M2

DAN2-M2 Fig. 4. Sunspot activity plots (forecasting).

better than all other models in Table 2. The training and forecast values of this model are depicted in Figs. 3 and 4, with a training set R2 ¼ 0:920. We have also compared the performance of our model with that of a commercial ANN product. We have used the Clementine data-mining software from SPSS [20]. The Clementine software offers various options for constructing the neural net architecture, including an approach that uses a pruning algorithm. We experimented with various configurations and found the ‘‘multiple network topologies’’ option to produce the best results for this problem. The results of applying this option to the same sunspot sets, with lagged input years 1, 2, 9, and 11, are also presented in

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

411

Table 3 The marketing example MSE and MAD values for DAN2 and Clementine Model

SPSS MSE (MAD)

DAN2 MSE (MAD)

Training 5 points ahead 11 points ahead 22 points ahead

7.87 6.79 5.54 8.28

5.58 3.88 3.82 6.37

(2.29) (2.22) (1.96) (2.25)

(1.92) (1.85) (1.61) (1.93)

Table 2, with a training set R2 ¼ 0:850. This direct comparison shows that DAN2 outperforms the SPSS implementation. Our experiments were conducted by applying an implementation of this algorithm that uses the comprehensive grid line search technique described in Section 4. We expect to further improve these results once the optimization techniques described in Section 4 are fully implemented. 6.2. Example 2: The Clementine marketing data set For this example we have selected a marketing campaign data set from the Clementine library, to compare the performance of our model with Clementine. The data set examines the relationship between promotional activities and overall revenue generated from the sale of promoted products. The independent variables are the price of an item, total revenue from sales of the product, and the amount spent on various promotional campaigns for the item. The dependent variable measures percentage increase in total revenue. The data set consists of 200 records. Again, in order to directly compare DAN2’s performance against Clementine we use the exact same model and the exact same set of data records for the training and forecasting periods. We have used the first 178 records for training and the remaining records for forecasting (testing). The results of running the Clementine and DAN2 models are presented in Table 3. As the results indicate, the DAN2 model improves MSE value for both the in-sample (by 48%) and out-of-sample data (45%, 31%, and 23% for 5, 11, and 22 points ahead, respectively). The DAN2 model MAD values are also more favorable.

7. Conclusions We have presented a new approach to neural network modeling that offers several advantages over existing feed-forward, back-propagation models. The basic architecture of the model uses the entire in-sample data set collectively, both for the initial estimation of the parameters of the model, as well as in updating them. This simultaneous use of the input values enables DAN2 to capture data patterns more effectively. A major distinction between this model and the traditional ANN is the way hidden layers and hidden nodes are introduced and utilized. In DAN2, the

ARTICLE IN PRESS 412

M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

number of hidden nodes per layer is fixed at four. The number of iterations and therefore hidden layers is a function of the desired level of accuracy or fit, which is user-defined. The model provides scalability, since the initial step (the linear regression stage) is the only point in the process where the number of variables and observations has an influence on the number of estimated parameters. Thereafter, the model continuously and dynamically estimates and updates only five parameters, thus making the model extremely scalable. Finally, the model uses a closed form set of equations that are obtained at the end of each step of the training process. This partially removes the ‘‘black box’’ notion that has been associated with neural network models. The results presented here demonstrate the effectiveness of the model compared to traditional models, when applied to an established benchmark data set (Wolf’s sunspot activity data), as well as in direct comparison with a commercial implementation of the FFBP method. The full strength of the model will require implementation of the optimization techniques introduced in Section 4. Once these techniques are investigated, we believe even more favorable results will emerge.

Acknowledgements The authors wish to thank David K. Zimbra for his contribution to this paper. References [1] R. Battiti, First and second order methods for learning: between steepest decent and Newton’s method, Neural Comput. 4 (2) (1992) 141–166. [2] P. Bloomfield, Fourier Analysis of Time Series: An Introduction, Wiley, New York, 1976. [3] R.P. Brent, Algorithms for Minimization without Derivatives, Prentice-Hall, Englewood Cliffs, NJ, 1973. [4] D.L. Chester, Why two hidden layers are better than one?, in: Proceedings of the International Joint Conference on Neural Networks, 1990, pp. 1265–1268. [5] M. Cottrell, B. Girard, Y. Girard, M. Mangeas, C. Muller, Neural modeling for time series: a statistical stepwise method for weight elimination, IEEE Trans. Neural Networks 6 (6) (1995) 1355–1364. [6] C. DeGroot, D. Wurtz, Analysis of univariate time series with connectionist nets: A case study of two classical examples, Neurocomputing 3 (1991) 177–192. [7] S. Falhman, Faster-learning variations of back-propagation: an empirical study, in: D. Touretzky, G. Hinton, T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School, 1989, pp. 38–51. [8] R. Hecht-Nielsen, Neurocomputing, Addison-Wesley, Menlo Park, CA, 1990. [9] T. Hill, L. Marquez, M. O’Connor, W. Remus, Artificial neural networks for forecasting and decision making, Int. J. Forecasting 10 (1994) 5–15. [10] D.M. Himmelblau, Applied Nonlinear Programming, McGraw-Hill, New York, 1972. [11] J.J. Hopfield, Neural networks and physical systems with emergent collective computational properties, Proc. Natural Acad. Sci. USA 79 (1982) 2554–2558. [12] K. Hornik, Approximation capabilities of multilayer feed-forward networks, Neural Networks 4 (1991) 251–257.

ARTICLE IN PRESS M. Ghiassi, H. Saidane / Neurocomputing 63 (2005) 397–413

413

[13] R.A. Jacob, Increased rates of convergence through learning rate adaptation, Neural Networks 1 (4) (1988) 295–308. [14] S. Kang, An investigation of the use of feed-forward neural networks for forecasting, Ph.D. Thesis, Kent State University, 1991. [15] T. Kohonen, Self-organized formation of topologically correct feature maps, Biol. Cybern. 43 (1982) 59–69. [16] R. Reed, Pruning algorithms—a survey, IEEE Trans. Neural Networks 4 (5) (1993) 740–747. [17] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representation by back-propagating errors, in: D.E. Rumelhart, J.L. McCleland (Eds.), Parallel Distributed Processing, vol. 1, MIT Press, Cambridge, MA, 1986. [18] D. E Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating errors, Nature 323 (6188) (1986) 533–536. [19] R. Sharda, Neural networks for the MS/OR analyst: an application bibliography, Interfaces 24 (2) (1994) 116–130. [20] SPSS—Clementine Data Mining System, SPSS, Inc., 1998. [21] Z. Tang, P.A. Fishwick, Feed-forward neural nets as models for time series forecasting, ORSA J. Comput. 5 (4) (1993) 374–385. [22] A.S. Weingend, N.A. Greshenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, Reading, MA, 1993. [23] F.S. Wong, Time series forecasting using back-propagation neural networks, Neurocomputing 2 (1991) 147–259. [24] X.H. Yu, G.A. Chen, S.X. Cheng, Dynamic learning rate optimization of the back-propagation algorithm, IEEE Trans. Neural Networks 6 (3) (1993) 669–677. [25] G. Zhang, B.E. Patuwo, M.Y. Hu, Forecasting with artificial neural networks: the state of the art, Int. J. Forecasting 14 (1998) 35–62. [26] G.P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing 50 (2003) 159–175. [27] X. Zhang, Time series analysis and prediction by neural networks, Optim. Meth. Software 4 (1994) 151–170. Manoochehr Ghiassi is currently an associate professor and chair of the department of Operations & Management Information Systems at Santa Clara University, Santa Clara, CA. He received a B.S. from Tehran University, Iran, and an M.S. in Economics from Southern Illinois University at Carbondale. He also received an M.S. in Computer Science and a Ph.D. in Industrial Engineering both from the University of Illinois at Urbana-Champaign. His current research interests include artificial neural networks, software engineering, software testing and simulation modeling. He is a member of IEEE and ACM.

Hassine Saidane is currently an independent data-mining consultant, researcher, and adjunct faculty at National University in San Diego. He has over 20 years of business experience in forecasting and quantitative data analysis for decision supports at AT&T, Lucent Technologies, and NCR’s Data-Mining Lab. He holds a B.S. in Physics from University of Tunis, Tunisia, an M.S. in Managerial Accounting, and a Ph.D. in industrial engineering from the University of Illinois at Urbana-Champaign. His current research interests are in machine learning, forecasting, and data-mining algorithms.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.