EAI732: Intelligent Systems Assignment 2: Neural Networks

May 31, 2017 | Autor: Wihan Booyse | Categoria: Artificial Intelligence, Artificial Neural Networks
Share Embed


Descrição do Produto

University of Pretoria Faculty of Engineering, Built Environment and Information Technology Department of Electrical, Electronic and Computer Engineering

EAI732: Intelligent Systems Technical Report Assignment 2: Neural Networks

Student:

Student Number:

Wihan Booyse

12129233 April 14, 2016

DECLARATION OF ORIGINALITY UNIVERSITY OF PRETORIA

The University of Pretoria places great emphasis upon integrity and ethical conduct in the preparation of all written work submitted for academic evaluation. While academic staff teach you about referencing techniques and how to avoid plagiarism, you too have a responsibility in this regard. If you are at any stage uncertain as to what is required, you should speak to your lecturer before any written work is submitted. You are guilty of plagiarism if you copy something from another author’s work (e.g. a book, an article or a website) without acknowledging the source and pass it off as your own. In effect you are stealing something that belongs to someone else. This is not only the case when you copy work word-for-word (verbatim), but also when you submit someone else’s work in a slightly altered form (paraphrase) or use a line of argument without acknowledging it. You are not allowed to use work previously produced by another student. You are also not allowed to let anybody copy your work with the intention of passing if off as his/her work. Students who commit plagiarism will not be given any credit for plagiarised work. The matter may also be referred to the Disciplinary Committee (Students) for a ruling. Plagiarism is regarded as a serious contravention of the University’s rules and can lead to expulsion from the University. The declaration which follows must accompany all written work submitted while you are a student of the University of Pretoria. No written work will be accepted unless the declaration has been completed and attached. Full names of student:

Wihan Booyse

Student number:

12129233

Topic of work:

EAI 732 Assignment 1

Declaration 1. I understand what plagiarism is and am aware of the University’s policy in this regard. 2. I declare that this assignment report is my own original work. Where other people’s work has been used (either from a printed source, Internet or any other source), this has been properly acknowledged and referenced in accordance with departmental requirements. 3. I have not used work previously produced by another student or any other person to hand in as my own. 4. I have not allowed, and will not allow, anyone to copy my work with the intention of passing it off as his or her own work.

SIGNATURE: Wihan Booyse

DATE: April 14, 2016

Abstract The following report is an investigation into the use of Neural Networks to solve classification problems. The primary focus of the report is a quantification of the efficacy and efficiency of various training algorithms on the benchmark problems of the Proben1 dataset. To this end a basic Feed Forward Neural Network was constructed for use with 9 unique training algorithms have been implemented from first principles and tested on 8 different datasets and using 24 different neural network architectures. The performance of each training algorithm on the test problems was evaluated and it was concluded that the Quasi-Newton Algorithm significantly outperforms all other algorithms in terms of Computational Expense and Solution Accuracy. In addition the results of the Neural Network correlate very well with the results of the Proben1 dataset.

ii

Contents 1 Introduction

1

2 Deviation From Assignment

1

3 Objectives

1

4 Proben1 Dataset

2

5 Experimental Procedure

2

6 Basic Feedforward Neural Network

3

6.1

Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

6.2

Neural Network Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

6.3

Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

6.4

Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

6.5

Transformation from Backpropagation to Gradient . . . . . . . . . . . . . . .

8

7 Paradigm of Gradient based Algorithms

8

8 Resilient Propagation (RPROP) Algorithm

9

8.1

RPROP Step Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

8.2

RPROP Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

8.3

RPROP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

9 Line Search

11

10 Steepest Descent and Gradient Descent

12

10.1 Steepest Descent and Gradient Descent Step Direction . . . . . . . . . . . . .

13

10.2 Gradient Descent Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

10.3 Steepest Descent Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

10.4 Problems Regarding Steepest Descent and Gradient Descent . . . . . . . . . .

14

10.5 Gradient Descent Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

11 Conjugate Gradient

16

11.1 Problems Regarding Conjugate Gradient . . . . . . . . . . . . . . . . . . . . .

17

11.2 Second Order Properties of Conjugate Gradient . . . . . . . . . . . . . . . . .

17

12 Quasi-Newton Method

18

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

18

12.2 Advantages of Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

12.3 Disadvantages and Modifications of Quasi-Newton . . . . . . . . . . . . . . .

21

13 Genetic Algorithm Differential Evolution

21

13.1 Genetic Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

13.2 Discussion of Implemented Genetic Algorithm . . . . . . . . . . . . . . . . . .

22

13.3 Comparison Between Genetic Algorithm and Differential Evolution . . . . . .

22

13.4 Genetic Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . .

22

13.4.1 Initial Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

13.4.2 Assign Fitnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

13.4.3 Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

13.4.4 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

13.4.5 Natural Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

13.5 Genetic Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

13.6 Benefits of Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

13.7 Drawbacks of Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

25

14 Particle Swarm Optimisation

25

14.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

14.2 Particle Swarm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .

26

14.3 Swarm Parameter Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

14.4 Benefits of Particle Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

14.5 Drawbacks of Particle Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

15 Hybrid Methods

28

15.1 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

15.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

16 Stopping Criteria

29

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

16.2 PQ Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

16.3 PQ Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

17 Experimental Results

29

17.1 Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

17.1.1 Cancer1 4+2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

30

17.1.2 Cancer2 8+4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

31

17.1.3 Cancer 3 4+4 Architecture . . . . . . . . . . . . . . . . . . . . . . . .

32 iv

17.2 Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

17.2.1 Card1 4+4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .

33

17.2.2 Card 2 4+0 Architecture

. . . . . . . . . . . . . . . . . . . . . . . . .

34

17.2.3 Card 3 16+8 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

35

17.3 Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

17.3.1 Diabetes 1 2+2 Architecture . . . . . . . . . . . . . . . . . . . . . . .

36

17.3.2 Diabetes 2 16+8 Architecture . . . . . . . . . . . . . . . . . . . . . . .

37

17.3.3 Diabetes 3 4+4 Architecture . . . . . . . . . . . . . . . . . . . . . . .

38

17.4 Glass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

17.4.1 Glass 1 8+0 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

39

17.4.2 Glass 2 32+0 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

40

17.4.3 Glass 3 16+8 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

41

17.5 Heart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

17.5.1 Heart 1 8+0 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

42

17.5.2 Heart 2 4+0 Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

43

17.5.3 Heart 3 16+8 Architecture . . . . . . . . . . . . . . . . . . . . . . . .

44

17.6 Heartc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

17.6.1 Heartc1 4+2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

17.6.2 Heartc2 8+8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

17.6.3 Heartc3 24+0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

17.7 Horse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

17.7.1 Horse1 4+0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

17.7.2 Horse2 4+4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

17.7.3 Horse3 8+4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

17.8 Soybean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

17.8.1 Soybean1 16+8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

17.8.2 Soybean2 32+0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

17.8.3 Soybean3 16+0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

18 Critical Analysis of Training Algorithms

53

18.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

18.2 Average Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

18.3 Normalised Least Squares Inconsistency . . . . . . . . . . . . . . . . . . . . .

57

18.4 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

19 Algorithm Training Characteristics 19.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57 58 v

19.2 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

19.3 RPROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

19.4 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

19.5 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

19.6 PSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

19.7 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

19.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

20 Conclusion

62

vi

List of Figures 1

Gradient Descent Convergence . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2

Steepest Descent Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3

Steepest Descent Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4

Gradient Descent Convergence on Cancer 1 Dataset . . . . . . . . . . . . . .

58

5

Steepest Descent Convergence on Cancer 1 Dataset . . . . . . . . . . . . . . .

58

6

RPROP Convergence on Cancer 1 Dataset . . . . . . . . . . . . . . . . . . . .

59

7

Conjugate Gradient Convergence on Cancer 1 Dataset . . . . . . . . . . . . .

59

8

Quasi-Newton Convergence on Cancer 1 Dataset . . . . . . . . . . . . . . . .

60

9

Particle Swarm Convergence on Cancer 1 Dataset . . . . . . . . . . . . . . . .

60

10

Genetic Algorithm Convergence on Cancer 1 Dataset . . . . . . . . . . . . . .

61

List of Tables 1

Comparison between local and global strategies . . . . . . . . . . . . . . . . .

28

2

Cancer 1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . .

30

3

Cancer 2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . .

31

4

Cancer 3 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . .

32

5

Card 1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

33

6

Card 2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

34

7

Card 3 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

35

8

Diabetes 1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . .

36

9

Diabetes 2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . .

37

10

Diabetes 2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . .

38

11

Glass 1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

39

12

Glass 2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

40

13

Glass 3 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

41

14

Heart 1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

42

15

Heart 2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

43

16

Heart 3 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

44

17

Heartc1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . .

45

18

Heartc2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . .

46

19

Heartc3 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . .

47

20

Horse1 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

48

vii

21

Horse2 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

49

22

Horse3 Training Algorithm Performance . . . . . . . . . . . . . . . . . . . . .

50

23

Soybean1 Training Algorithm Performance

. . . . . . . . . . . . . . . . . . .

51

24

Soybean2 Training Algorithm Performance

. . . . . . . . . . . . . . . . . . .

52

25

Soybean3 Training Algorithm Performance

. . . . . . . . . . . . . . . . . . .

53

26

Average Normalised Least Squares Performance of Training Algorithms on Test

27

Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Average Classification Error Performance of Training Algorithms on Test Set

56

List of Algorithms 1

Basic Feedforward Neural Network . . . . . . . . . . . . . . . . . . . . . . . . .

8

2

RPROP Algorithm

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3

Armijo Line Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

4

Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

viii

1

Introduction

The following report is an investigation into the performance of a Feedforward Artificial Neural Network and the effect of various training algorithms on the neural network. To this end one neural network will be created and the performance of this network will be analysed when trained using 9 different training algorithms. The training algorithms which will be used are: 1. Gradient Descent 2. Resilient Propagation (RPROP) 3. Steepest Descent 4. Conjugate Gradient 5. Quasi-Newton using the Davidon-Fletcher-Powell (DFP) inverse Hessian update 6. Genetic Algorithm - Differential Evolution 7. Particle Swarm Optimisation 8. Genetic Algorithm coupled with Quasi Newton (GAQN) 9. Particle Swarm Optimisation coupled with Quasi-Newton (PSOQN)

2

Deviation From Assignment

As this report places a heavy emphasis on a comparative analysis between the training algorithms it is necessary to deviate from the specified assignment due to the computational time required to train the 9 different neural networks. To this end the section to investigate the use of a neural network on the MNIST dataset will be omitted as the MNIST dataset is very computationally expensive due to the required number of hidden nodes which have a direct impact on the training requirements. In addition to this there is significant literature available which demonstrates the power of an ANN on the MNIST dataset [5].

3

Objectives

The objective of the following investigation is to determine the performance of Artificial Neural Networks when trained using global and local optimisation algorithms. In particular the primary parameters of interest will be the accuracy of the predictions and the number of function evaluations required for training.

1

4

Proben1 Dataset

In order to compare neural networks it is necessary to make use of a well documented database which includes benchmarks for neural network performance. One such dataset is the Proben1 [7] dataset. The Proben1 dataset consists of 15 real datasets (benchmarking problems) which have all been solved by the authors using the Resilient Propagation (RPROP) algorithm. The Proben1 dataset’s documentation contains a comprehensive explanation of all 15 benchmarking problems. The Proben1 dataset is an exceptionally powerful tool for the evaluation of neural networks as it is possible to compare the results of the network which is being evaluated and the Proben1 results. The Proben1 dataset has a number of key advantages which make it exceptionally useful for the evaluation of neural networks. 1. There is a systematic explanation of the entire process used to solve the datasets 2. The error metrics used are explained in detail 3. All benchmark problems are real datasets and not synthetic datasets which are inherently biased. 4. The benchmark problems cover a wide range of problems with both classification and regression problems 5. The expected performance of a neural network with a specified topology is well documented and thus may be used as a control group for the comparison of results 6. The documented results of the Proben1 dataset also serve as a validation measure to ensure that the neural network is performing as expected It should also be noted that it is not necessary to explain what each dataset contains and represents as neural networks essentially do not care what the data they are given means, which in itself is a testament to the neural networks prediction capabilities.

5

Experimental Procedure

The neural network’s will be evaluated using the Proben1 dataset. In order to measure the effect of training algorithms on the performance of neural networks, one generic feedforward neural network will be developed. In order for the neural network to be satisfactory for the task at hand it is necessary that the neural network satisfies the following specifications: 1. Able to handle all basic feedforward topologies

2

2. Perform both classification and regression 3. Compatible with all 9 training algorithms 4. Compatible with any training algorithm with slight modification dependant on the training algorithm requirements 5. Able to generate results similar to those of the Proben1 dataset when using RPROP Once a neural network with these characteristics exists, the Proben1 datasets will be solved using the 9 different training algorithms. This process is outlined as follows: 1. All of the classification datasets in Proben1 except for the mushroom, gene, will be solved using each training algorithm 20 times 2. The topology of the neural networks will be consistent with the topologies specified in Table 5 of [7](the proben1 publication). 3. For classification problems the accuracy comparison metric which will be used is the classification error and the Proben1 Normalised Least Squares Error [7] 4. Unfortunately the regression problems are not presented due to an error in their error function when they were submitted to the cluster 5. All algorithms except for the Differential Evolution Genetic Algorithm will use exactly the same stopping conditions 6. The training algorithms will be compared using two metrics. The number of function evaluations and the accuracy comparison metric. The solutions produced by the various neural networks will then be critically compared to the expected Proben1 results. With particular emphasis on the accuracy of the solution, the consistency of a training algorithm and the computational expense required to train the network.

6

Basic Feedforward Neural Network

In order to evaluate the performance of different training algorithms it is imperative that a single generic feedforward neural network is created and training algorithms are simply ’plugged into’ the generic feedforward neural network.

6.1

Network Architecture

The neural network that has been created is a basic feedforward network with gradient information obtained by means of backpropagation. The activation function that has been used is the tanh function. This activation function 3

was chosen on the basis of its simple derivative. The tanh function may be seen below in Equation 1 and its derivative is given by Equation 2 tanh(a) =

ea ≠ e≠a ea + e≠a

tanhÕ (a) = 1 ≠ tanh(a)2

(1)

(2)

For classification it is also necessary to specify an output function to normalise the outputs of the final hidden node into class probabilities. One function which is suitable for this is the Softmax given below in Equation 3 y=

ea ||ea ||

(3)

Once the activation function and the output function have been specified the neural network may be constructed in a feedforward manner as follows. For a network with two hidden layers (this however may be extended to N hidden layers): 1. Each feature value plus one bias feature value from the input data x is multiplied by the weight matrix W(1) of the first hidden layer which results in the input vector to the first layer a(1) . 2. The first layer input vector a(1) is then passed through the activation function resulting in the output vector of the first hidden layer z(1) . 3. The output vector z(1) of the first hidden layer is then multiplied by the weight matrix of the second hidden layer W(2) resulting in the input vector to the next layer a(2) . 4. The second layer input vector a(2) is then passed through the activation function resulting in the output vector of the second hidden layer z(2) . 5. For classification the output vector of the second hidden layer is multiplied by the output layer weight matrix W(3) and the resulting vector z(3) is then passed through the softmax function (Equation 3) which results in the output class probabilities y. The output class is then selected as the highest probability in y. 6. For regression the output vector of the second hidden layer is multiplied by the output layer weight matrix W(3) and the resulting vector y is the output value.

6.2

Neural Network Intuition

An intuitive explanation of how artificial neural networks work is as follows. Consider a simple regression neural network with one hidden layer. The input vector to the output node a(2) 4

may be seen as the solution to a set of linear equations with coefficients W(2) and variables y. This equation is expressed mathematically in Equation 4. W(2) z(1) = y

(4)

This logic may then be propagated backwards through the neural network considering the hidden node, the input to the hidden node a(1) may be seen as the solution to the following system of equations outlined in Equation 5: W(1) x = a(1)

(5)

At each node there is a non-linear mapping that occurs which links the output of the node and the input of the node. this mapping is performed by the activation function outlined in Equation 1. Let h(÷) denote any arbitrary activation function. Then the mapping between the input and output of a hidden node is given by Equation 6.

h(a(1) ) = z(1)

(6)

Rearranging Equation 4, Equation 5 and Equation 6 yields the relationship given in Equation 7

W(2) h(W(1) x) = y

(7)

Equation 7 demonstrates that a neural network may be solved by solving a system of non-linear equations where the unknowns are not the solutions to the equations but the coefficients of this system of non-linear equations. From Equation 7 the process by which an artificial neural networks performs classification and regression is clear. In the first layer the neural network maps the input data into a dimensional space consistent with the number of hidden nodes and constructs a linear regression problem where the solution is the input data. The output of this linear regression problem then undergoes a non-linear transformation and this transformed data is then mapped into another dimensional space consistent with the number of outputs and another linear regression problem is constructed. Using the training algorithm, the weights (or coefficients of the two regression algorithms) are tuned to minimise a specified error function. Essentially the entire artificial neural network is a highly complex non-linear function of a specified form. The training algorithms then tune the parameters of this non-linear function to a point where given the input it results in an output close to the desired output.

5

6.3

Neural Network Training

Although the system of non-linear equations outlined in Equation 7 when solved will only result in over fitting it is possible to set up the problem as a regression problem. The regression problem may be set up by specifying an error metric which relates the output of the neural network to the target values and consequently this error metric should be minimised. The error metrics used for the classification and regression are given in Equation 8 Equation 9 respectively.

E = ≠tT ú ln(y)

(8)

Where t is simply a binary encoding of the correct class, thus resulting in a scalar value for Equation 8. E = ||y ≠ t||

(9)

In order for a neural network to be trained it is necessary to minimise the error by tuning the weights. As this is a multidimensional optimisation the availability of gradient information will significantly reduce the time required to solve the system as calculating the gradients numerically will require significantly more function evaluations. The gradient of the training algorithm may be found by differentiating the error metric with respect to the weights. This is done using a method known as back-propagation. [2]

6.4

Back Propagation

Back-propagation is simply a method for systematically evaluating the derivatives of the error function of an artificial neural network with respect to the weights. The method of back-propagation may be outlined as follows. For a Network with N-1 hidden layers.

Proof. By definition, 1. 2.

ˆE (N )

ˆWij

=

ˆE (N ≠1)

ˆWjk

(N )

ˆE ˆyi ˆai ˆyi ˆa(N ) ˆW (N ) i

ij

N ≠1 ≠1 (N ) ˆaN ˆE ˆyi ˆai ˆzj j = ≠1 (N ≠1) ˆyi ˆa(N ) ˆzjN ≠1 ˆaN ˆW j i

jk

It follows that: ˆE (N ≠1) 1. = (y ≠ t)zT ˆW(N ) ˆE (N ) (N ≠2) 2. = (WT (y ≠ t)) § hÕ (a(N ≠1) )zT (N ≠1) ˆW 6

ˆE (N ≠1) (N ) (N ≠3) = (WT [(WT (y ≠ t)) § hÕ (a(N ≠1) )] § hÕ (a(N ≠2) )zT ˆW(N ≠2) It is evident that a pattern is emerging, whenever the next layer from the back is considered 3.

the results of the previous derivative are used to calculate the following derivative. Using this relationship it is possible to cascade derivatives in the following manner ” (N ) = (y ≠ t)

” (N ≠1) = (WT

(N ) (N )

” (N ≠i) = (WT

(N +1≠i) (N +1≠i)



) § hÕ (a(N ≠1) ) ”

) § hÕ (a(N ≠i) )

1 6 i 6 N, i œ ›

Using these identities it is possible to define a general expression for the backpropagation derivatives ˆE (N ≠1) = ” (N ) zT N ˆW ˆE (N ≠i+1) (N ≠i) T (N ≠i≠1) = WT ” z ˆW(N ≠i) ˆE (1) = WT ” (0) xT (1) ˆW

i=0 1 6 i 6 (N ≠ 1), i œ › i=N

The backpropagation derivatives which have been presented here are the derivatives for one datapoint, consequently the derivatives should be calculated for each datapoint and summed in order to give the derivative of the entire dataset.

7

The algorithm used to construct the neural network is as follows. Result: Returns: Predictions, Error, Error Gradient Initialise: Initialise weights randomly between 0-0.1 N is the number of hidden nodes+1 for Datapoint 1:end do for i 1:N do if i < N then Calculate the values of a, W, z else Calculate the value of y Calculate Error end end Calculate Gradient Backpropagation end Sum Gradients Sum Errors Output Predictions

Algorithm 1: Basic Feedforward Neural Network

This function is then repetitively called by the training algorithm with new weights instead of the initialised weights. Once the error is satisfactory the process ends and the final weights are used to calculate the test set predictions.

6.5

Transformation from Backpropagation to Gradient

ˆE is rearranged into a single column ˆW vector. There are significant linear algebra advantages to this rearrangement. The validity of It should be noted that for the rest of the report

this transformation stems from Equation 7 and either of the error functions which state that the error is simply a scalar nonlinear multivariate function of all the weights. This rearranged derivative may be denoted as ÒE. Without this transformation all gradient based training

algorithms will become significantly more expensive and unnecessarily complex to implement.

7

Paradigm of Gradient based Algorithms

It is possible to visualise the current values of the weight parameters as a location in higher dimensional space and each location in this space has an associated error function value. The goal of any training algorithm is to locate to a point in space where the specified error function is a minimum without over fitting, this may be known as the utopia point.

8

In order for a gradient based training function to move towards the utopia point it is necessary that it is given a direction to move in. This direction is obtained from the gradient produced from backpropagation. The gradient of the error function produced by backpropagation is a vector. The direction of this vector is the direction of the maximum increase in the error function and the magnitude of this vector is an indication of the change in the value of the error function in the direction of the vector. As the goal of neural network training is to minimise the error function it is logical that the step that should be taken is in the direction opposite to the gradient. Once the direction of the step is known, it is necessary to determine the size of the step that should be taken. As the gradient produced by backpropagation is simply a vector of values and not a function, it is not possible to calculate this step size analytically. Each gradient based algorithm has its own means to determine the step size that should be taken and this decision is a core component of any gradient based algorithm. In summary, any gradient based algorithm may be broken down into two components. 1. The direction of the step 2. The size of the step These two components will be the primary focus in the explanations of the gradient based algorithms. An explanation of how these components are determined is a comprehensive explanation of the methodology of an algorithm.

8

Resilient Propagation (RPROP) Algorithm

The first algorithm to be discussed is the Resilient Propagation algorithm which is used as the benchmark algorithm for the Proben1 dataset. Resilient Propagation is an exceptionally popular algorithm for the training of networks as it is easy to implement, uses gradient information and is computationally inexpensive. [9]

8.1

RPROP Step Direction

RPROP is unlike any of the other gradient based algorithms which will be discussed as it only uses the sign of the elements in the gradient vector and not the magnitude. The reasoning behind this is that RPROP updates all weights with an equivalent magnitude and the gradient is simply used to indicate the sign of the update and hence a direction.

9

8.2

RPROP Step Size

The step size used in the RPROP algorithm is initialised to a value of 0.1 however the value of this parameter does not affect the outcome of the training. Once a step has been taken, the algorithm evaluates whether the previous step was too big or too small using Equation 10. Equation 10 is based on the principle that when a minimum is passed the sign of the gradient changes and thus if Equation 10 returns a positive value then the previous step was too small and the next step should be larger. If Equation 10 returns a negative value then a minimum was bypassed and the previous step was too large and the algorithm must backtrack and reduce the size of the step. Equation 10 returns a value of zero when the algorithm backtracked and now the reduced step should be taken. The parameters which are used to increase the step size and decrease the step size were selected to be ÷ + = 1.2 and ÷ ≠ = 0.5 respectively. Once again this choice of parameter does not have an impact on RPROP’s ability to find the solution[9].

› = ÒETk≠1 ÒEk

(10)

10

8.3

RPROP Algorithm

Result: Returns: Weight Initialise: Initialise first update as 0.1 define maximum step define minimum step while Stopping = False do if › > 0 (Step too big) then stepk+1 = min(stepk ú ÷ + , maxstep) directionk+1 = ≠sign(ÒEk )

Wk+1 = Wk+1 + directionk+1 ◊ stepk+1

end

else if › > 0 (Step too small) then stepk+1 = max(stepk ú ÷ ≠ , minstep) directionk+1 = ≠sign(ÒEk )

Wk+1 = Wk ≠ directionk ◊ stepk (backtrack)

end

else if › == 0 (Post Backtracking) then directionk+1 = ≠sign(ÒEk )

Wk+1 = Wk+1 + directionk+1 ◊ stepk+1

end

pass updated weights to neural network end

9

Algorithm 2: RPROP Algorithm

Line Search

For some gradient based algorithms to operate it is necessary that a line search is performed in order to find the optimal step size. As the line search is an iterative process it is highly likely that most of the function evaluations of a gradient based algorithm are due to this line search. In general there are two types of line searches, Exact and Inexact.[1]. Exact line searches return the minimum of the function along the search direction with high accuracy. However, exact line searches often require significantly more function evaluations than inexact line searches for a minimal increase in accuracy. As a consequence the line search which has been selected for use in the training of the neural network is an inexact line search based on Armijo’s rule which calculates the step size as follows:

Proof. Let: f (–) = f (Wk + –dk )

11

Using Taylor’s expansion it is possible to define a linear step function: q(–) = f (0) + –[flf Õ (0)] Where fl is a fixed number between 0 and 1; 0 < fl < 1 From this it is possible to produce the following inequalities A step is not too large if: f (–) 6 q(–) A step is not too small if: f (÷–) > q(÷–)

Where ÷ is a fixed value ÷ > 1

From these conditions it is possible to construct an inexact line search algorithm based on Armijo’s rule as follows: Result: Returns: Step Size Initialise: Select values for fl and ÷ Select a maximum number of iterations Select an initial step –k (The choice of the initial step does not matter) while [f (–) > q(–)orf (÷–) < q(÷–)] and k¡kmax do if f (–) > q(–) (Step too big) then – = –/÷ end else if f (÷–) < q(÷–) (Step too small) then – = ÷– end Pass – to optimisation algorithm end

Algorithm 3: Armijo Line Search Algorithm

For the purposes of this report the parameters selected for ÷ and fl were 0.9 and 2 respectively. These parameter values are recommended in [1]

10

Steepest Descent and Gradient Descent

Steepest Descent and Gradient Descent are both gradient based algorithms with only one significant difference. Gradient Descent assumes a step size(learning rate) where Steepest

12

Descent calculates the step size explicitly. Hence the step for both algorithms is given below in

Wk+1 = Wk + –dk

(11)

where : dk = Step Direction – = Step Size (Learning Rate)

10.1

Steepest Descent and Gradient Descent Step Direction

Steepest Descent and Gradient Descent both use the same step direction. This step direction is based purely on the gradient information supplied by backpropagation. This step direction denoted dk is calculated using Equation 12. dk = ≠ÒE

10.2

(12)

Gradient Descent Step Size

Gradient Descent is by far the most rudimentary gradient based algorithm. It assumes a fixed step size also known as the learning rate and then performs this step in the direction of maximum descent. The parameter of learning rate is specified by the designer and although there are methods to alter this rate as the algorithm progresses (adaptive learning, simulated annealing and momentum). Although these options exist Gradient Descent still possesses exceptionally poor convergence which may result in significantly longer training times than necessary. Since the neural network should be generic enough for it so solve all problems it was necessary to specify a learning rate of 0.02 as this value always obtained convergence.

10.3

Steepest Descent Step Size

Steepest Descent is a gradient based algorithm which is significantly more mathematically sound than Gradient Descent however it is also prone to slow convergence. The step size used by Steepest Descent is calculated explicitly for each step by means of either an exact or inexact line search. The line search used by Steepest Descent is a design choice and is dependent on the problem. This is discussed in detail in the next section regarding the choice of line search.

13

10.4

Problems Regarding Steepest Descent and Gradient Descent

Both Steepest Descent and Gradient Descent are first order optimisation algorithms. Which means that both of these algorithms do not attempt to approximate second order derivatives. This is results in a significant reduction in convergence in regions where the gradient does not change significantly in the relation to the step size. Both Steepest Descent and Gradient Descent may ’zig-zag’ around the solution, or even bypass it completely. This behaviour is best illustrated graphically. Consider the polynomial given by Equation 13 also known as Rosenbrock’s function. f (x, y) = 100(y ≠ x2 )2 + (1 ≠ x)2

(13)

Equation 13 has a minimum at (x,y)=1,1. Solving this problem using Gradient Descent yields Figure 1 and with Steepest Descent yeilds Figure 2. Gradient Descent Convergence Chraracteristics

1000 utopia point Training Path

5

900

800

4 700

3 600

2

500

400

1

300

0 200

-1 100

-2 -1

0

1

2

3

4

5

Figure 1: Gradient Descent Convergence

14

Steepest Descent Convergence Characteristics

1000

7 utopia point Training Path

900

6 800

5

700

4

600

500

3

400

2 300

1 200

0 100

-1 -2

-1

0

1

2

3

4

5

Figure 2: Steepest Descent Convergence

FromFigure 1 and Figure 2 it is evident how both Steepest Descent and Gradient Descent may experience slow convergence at locations near the utopia point. Steepest Descent was stopped after 1000 function evaluations, however it does eventually converge. Due the fact that both solvers blindly step in the direction of maximum descent, it is clear that this sort of ’zig-zag’ behaviour may occur. Furthermore, due to the specified step size used by Gradient Descent, it is entirely possible that it may bypass the Utopia point and cover a very large search space before returning. It is also likely that Gradient Descent bypasses a minimum and never returns as illustrated in Figure 1 where it terminated at a saddle point. It should be noted here that the function considered is not well conditioned to gradient based algorithms, however it is highly likely that a neural network is also not well conditioned. This ’zig-zag’ behaviour can be very detrimental to the solving times of neural networks using both these algorithms as the training error is indeed decreasing with each step, but it is decreasing at an exceptionally suboptimal rate. This poor convergence may be mitigated by using solvers which consider second order information as the second derivative of a function is an indication of the curvature of the function and is thus an indication of whether the current weight parameters are indeed at a local minimum.

15

10.5

Gradient Descent Heuristics

Gradient Descent heuristics such as momentum weight and inertia were not considered here as the implementation thereof would not substantially contribute to this report when compared to the time it would take to evaluate their impact.

11

Conjugate Gradient

The first algorithm to be considered that uses second order information is the Conjugate Gradient algorithm. The Conjugate Gradient algorithm is a second order method which minimises the function in a direction that is also a minimum in all the previously computed search directions. Unlike Steepest Descent which only minimises the function in the current direction. Mathematically this may be expressed as:

ÒETk di = 0

(14)

Where i = 0, 1....., k ≠ 1 It is evident that given a highly nonlinear, non analytical problem it is exceptionally difficult to develop a closed form equation which satisfies Equation 14 for all values of i. Hence the approach which will be followed is the one outlined by Fletcher and Reeves in [6]. This approach approximates the function which is being considered as a quadratic function and thus benefits from the mathematical conveniences associated with quadratic functions. From the quadratic assumption the update direction may be calculated as: dk = ≠ÒEk + —k dk≠1

(15)

Which only requires the computation of —k which may be calculated using the Fletcher Reeves Update given in Equation 16 [6] —k =

ÒETk ÒEk ÒETk≠1 ÒEk

(16)

From Equation 15 and using a step size calculated from the line search. The Conjugate Gradient weight update is given below in Equation 17. Wk+1 = Wk + –(≠ÒEk + —k dk≠1 )

(17)

which is of the same form as the updates for Steepest Descent and Gradient Descent.

16

11.1

Problems Regarding Conjugate Gradient

The Conjugate Gradient algorithm is based on two key assumptions which may fail when it is applied to an artificial neural network. These assumptions are: 1. The error function may be approximated as a multidimensional quadratic equation. 2. The error function has been minimised in all previous directions The first assumption should be valid for regions near the minimum however when far away from the minimum, it is highly unlikely that the network will retain quadratic characteristics. This is an important consideration as it implies that it is possible that Conjugate Gradient may converge to a local minimum or a saddle point instead of to the solution. The second assumption is more critical considering an inexact line search is being used. As training a neural network is a computationally expensive process the amount of line search iterations were limited. This limit on the number of line search iterations may result in line search step size outputs that are not accurate enough to ensure that the current step size does indeed minimise the function in that direction. If this is the case it is possible that the Conjugate Gradient method will fail.

11.2

Second Order Properties of Conjugate Gradient

Classically the Conjugate Gradient method is seen as an extension of the Steepest Descent method and it is not linked to Quasi-Newton methods. However, it is possible to link Conjugate Gradient methods to Quasi-Newton methods which will be discussed in the next section. The reasoning behind the explanation of this link is to emphasise the power of Quasi-Newton methods. Due to the computationally expensive nature of neural network training it is essential that all information carried in the output of the error function and its gradient utilised as efficiently as possible.

Proof. Consider the Conjugate Gradient update: Wk+1 = Wk + –(≠ÒEk + —k dk≠1 ) Newton’s Method for finding the root of a multidimensional function may be written as [4]: xk+1 = xk ≠ [Hf (xk )]≠1 Òf (xk ) Where H is the Hessian of the function. Newton’s method may be modified to use a step smaller than unity as follows: xk+1 = xk ≠ –[Hf (xk )]≠1 Òf (xk ) 17

Consider the following transformation of the Conjugate Gradient weight update using the Fletcher-Reeves algorithm. Wk+1 = Wk + –(≠ÒEk + —k dk≠1 ) As —k is a scalar = Wk + –(≠ÒEk + dk≠1 —k ) = Wk + –(≠ÒEk + dk≠1 A

ÒETk ÒEk ) ÒETk≠1 ÒEk

dk≠1 ÒETk = Wk + – ≠I + ÒETk≠1 ÒEk

B

ÒEk

It is evident that the Conjugate Gradient method is attempting to approximate the inverse of the Hessian, however by definition for all twice continuously differentiable functions the Hessian and its inverse are symmetric. Where the approximation term from Conjugate Gradient is clearly not symmetric as the term dk≠1 ÒETk results in a matrix that is not necessarily symmetric. Additionally rewriting the Fletcher Reeves update in this form proves

that the current search direction is conjugate to previous search directions as it is simply an extension of Krylov subspaces. Where the term

3

≠I +

dk≠1 ÒET k ÒET ÒEk k≠1

4

Ek may simply be seen

as the creation of a new Krylov subspace. Which by definition is conjugate to the previous search directions.

Hence it is logical to consider a method in which the Hessian update is consistent with the shape of the expected Hessian, one such method is the Quasi Newton Method

12 12.1

Quasi-Newton Method Introduction

The Quasi-Newton method is also a second order algorithm and operates under the same quadratic assumption as the Conjugate Gradient method, however in Quasi-Newton methods the Hessian or its Inverse are approximated directly using the secant method. The direction for any Quasi-Newton method is given below in Equation 18 [1]. dk = ≠H≠1 k ÒEk

(18)

As the update term uses the inverse of the Hessian there are two methods which may be used to calculate the update step. These are: 18

1. Calculate the Hessian and solve a system of equations to obtain the inverse 2. Calculate the inverse of the Hessian directly Calculating the Hessian is known as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) approach[3]. Calculating the inverse of the Hessian is known as the Davidon-Fletcher-Powell(DFP) approach. Both approaches have their own respective advantages and disadvantages, however for this implementation the DFP approach was selected for one primary reason. As there are a large number of weights(variables) in this optimisation problem calculating the inverse of the Hessian by solving a system of equations will significantly slow down the neural network training process. In addition to this it is entirely possible that the Hessian is singular and thus it may not be invertible. This may be bypassed by calculating the inverse of the Hessian directly using the DFP method.

Proof. Consider the Secant Equation rewritten for the inverse of the Hessian (B) from a second order Taylor series where f (W) is the Error function: Wk = Bk (Òf (Wk +

Wk ) ≠ Òf (Wk )) = Bk Òfk

(19)

In addition the change of B at each step may be written as: Bk = abT As the Inverse of the Hessian is symmetric a and b should be the same. Hence the Hessian update at each step may take the following form: Bk+1 = Bk +

Bk

Rearranging Equation 19 yields an expression for a: a=

bT

1 Òfk

Wk ≠

bT

1 B Òfk Òfk

(20)

This results in an update for the inverse Hessian of the following form: Bk+1 = Bk +

3

bT

1 Òfk

Wk ≠

bT

4

1 B Òfk bT Òfk

(21)

In order to enforce the symmetry of the Inverse Hessian it is necessary to choose b of the following form: b=

Wk ≠ Bk Òfk 19

Substituting the equation for b into Equation 21 results in the following final update equation for the inverse Hessian. Bk+1 = Bk +

A

( Wk ≠ Bk Òfk )( Wk ≠ Bk Òfk )T ( Wk ≠ Bk Òfk )T Òfk

B

(22)

This is the DFP update with a slight modification where the symmetry is invoked

Using the DFP method for Quasi Newton the step direction at each update is given in Equation 23 dk = ≠Bk ÒEk

(23)

Once again similarly to the Steepest Descent method the size of the step is calculated using the line search. Resulting in the following equation for the update of the weights at each step using the Quasi-Newton Method. Wk+1 = Wk ≠ –(Bk ÒEk )

12.2

(24)

Advantages of Quasi-Newton

Unlike Conjugate Gradient, Quasi-Newton does not require very accurate step sizes in order to converge. In addition Quasi-Newton has excellent convergence characteristics near the Utopia point and does not ’zig-zag’ around the solution like Steepest Descent or Gradient Descent. This is illustrated by Quasi-Newton’s performance on Rosenbrock’s function pictured below in Figure 3 Quasi-Newton Convergence Characteristics

1000

5

900

4 utopia point Training Path

3

800

700

2 600

1

500

0 -1

400

-2 300

-3 200

-4 100

-5

-3

-2

-1

0

1

2

3

4

5

Figure 3: Steepest Descent Convergence

20

Clearly Quasi-Newton has convergence characteristics which are far superior to both Gradient Descent and Steepest Descent. Quasi-Newton is also the only gradient based algorithm that has been considered here that has super linear convergence.

12.3

Disadvantages and Modifications of Quasi-Newton

Quasi-Newton being a gradient based solver is susceptible to convergence to local minima or saddle points. In addition the Hessian for neural networks will not be constant over the domain as implied by the quadratic assumption in the derivation of Quasi-Newton. The effect of this may however be mitigated by resetting the Hessian after a few iterations to ensure that the Hessian being considered is sensitive to local gradients at the evaluation point. For the Quasi-Newton solver in this problem the inverse of the Hessian was reset to the Identity matrix after 5 iterations. This decision may in some cases slow the progression of the solver but significantly improves its robustness. Another popular modification of Quasi-Newton for neural networks is the Marquadt modification also known as the Levenberg-Marquadt algorithm. In this case the Inverse Hessian’s contribution is dampened by adding a weighted Identity matrix to the to the Bk term.

13 13.1

Genetic Algorithm Differential Evolution Genetic Algorithm Overview

Genetic Algorithms are algorithms which abide by the natural process of evolution. Starting with an initial population, which in this case is a population of weight parameters. This population is then evaluated and each member of the population is assigned a fitness parameter. This fitness parameter is based on how well the given member minimises the error of the neural network. Once each member of the population has been assigned a fitness they begin to reproduce. During the reproduction, fitter members of the population are more likely to reproduce and create offspring. The offspring then undergo mutations receiving donor weight parameters from their parents. The mutated offspring have their fitness evaluated and if they are more fit than their donor parent, they replace this parent. This process continues for a specified number of iterations and once the process is completed all the members of the final population undergo a validation fitness evaluation. After the validation fitness evaluation the member with the highest validation fitness is selected as the solution.

21

13.2

Discussion of Implemented Genetic Algorithm

Genetic Algorithms(GA) are traditionally written for binary populations and consequently it becomes very difficult to convert these algorithms to a continuous domain without performance losses or having certain aspects become redundant. To this end the algorithm which has been implemented is a hybrid between a traditional genetic algorithm and Differential Evolution(DE). Although the algorithm that will be presented in this section is not a carbon copy GA or DE algorithm it is based on both of these algorithms. The algorithm presented in this section is primarily based on the DE which is presented in [1] but with a more GA heavy implementation. However, from an optimisation point of view this is a Differential Evolution algorithm.

13.3

Comparison Between Genetic Algorithm and Differential Evolution

Both DE and GA are based on the same paradigm. Both algorithms start with a large initial population which are all potential solutions to the backpropagation problem. Both algorithms use crossover to generate new population members. However, traditionally only the GA algorithm mutates the new population member but mutation is also included in this adaptation of the algorithm. Hence the algorithm considered here is a form of a Genetic Algorithm.

13.4

Genetic Algorithm Implementation

The Genetic Algorithm presented in this report is essentially a five step process. 1. Create an initial population 2. Assign Fitnesses to the population 3. Reproduce 4. Mutate 5. Natural Selection 13.4.1

Initial Population

The initial population is created by randomising weights for a population size five times larger than the dimensionality of the problem. This large population is beneficial as it results in far larger coverage of the search space.

22

13.4.2

Assign Fitnesses

The fitness of each member of the population is calculated by running the neural network using its weight. The error function value for each weight is recorded and the fitness F of each population member i is assigned as follows: Fi = (1 ≠ ‘)Emax ≠ Ei

(25)

Where ‘ is a small number (1 ú 10≠10 ) to prevent dividing by zero for the weakest population member.

Once the fitness of each member has been calculated, each member is given a reproductive parameter which is a given member’s probability of reproducing. This reproductive parameter is given as: Fi Ri = qNpop j=1

13.4.3

(26)

Fj

Reproduction

In order to perform reproduction three parent designs (W1 , W2 , W3 ) are chosen from the population based on the the probabilities of the reproductive parameter, these three designs are then combined using Equation 27 to form a new offspring design WV . WV = W1 + Ÿ(W2 ≠ W3 )

(27)

Where Ÿ is simply a chosen parameter. 13.4.4

Mutation

The offspring design WV then undergoes a mutation process with a fourth donor design Wp which is also randomly selected based on the reproductive probability. The mutation of the offspring design is as follows, each characteristic (individual weight) of the offspring design may be randomly mutated into that respective weight of the donor based on the process of Equation 28 where WU is the final mutated offspring design. Note for Equation 28 the notation is switched to indicial notation. Y _ ] W V , if rj < Cr j U Wj = ;j = 1 : n p _ [

Wj ,

otherwise

(28)

Where: rj is a random value between 0 and 1 Cr is the mutation rate

23

13.4.5

Natural Selection

At the selection step the fitness of the new population member FU is evaluated and if its fitness is higher than the fitness of the donor Fp then it replaces the donor in the population.

13.5

Genetic Algorithm Summary

The Genetic Algorithm presented in this section is presented below in algorithmic form. Result: Returns: Champion Member Initialise: Select values for Cr and Ÿ Select a maximum number of iterations, k Generate an Initial Population Assign fitness to all members of the population Generate reproductive parameter for all members while k FP then Replace the donor member with the mutated offspring else Reject the Mutated offspring end k=k+1 end Evaluate validation fitness of all members member with highest validation fitness is the champion member Algorithm 4: Genetic Algorithm

13.6

Benefits of Genetic Algorithm

The Genetic Algorithm is a global search method which implies that it is able to not only locate local minima but it will converge to global minima, this is exceptionally useful as this may result in significantly better solutions than solvers which can only guarantee local minima convergence.

24

13.7

Drawbacks of Genetic Algorithm

One of the most powerful aspects of a neural network is the gradient information with is available via back propagation. As the Genetic Algorithm does not consider gradient information it will take significantly more iterations than gradient based solvers as the only information that it is privy to is the function values. The non-utilisation of gradient information is exceptionally bad practise as any minimisation is essentially setting the gradient to zero. With this considered it may be an interesting investigation to construct a genetic algorithm in which the aim is to minimise the gradient available via back-propagation and simply use function values to identify whether the solution is indeed a minimum and not a maximum or saddle point, however such an investigation is not within the scope of this report. The Genetic Algorithm is dependent on a severe amount of randomness especially when considering which members of the population are selected as parents and donors. This randomness may result in certain members of the population never even being considered for reproduction or mutation, a consequence of this is that valuable computational time and memory is used to ’carry’ these members of the population. A solution to this may be to alter the reproductive parameter in such a manner that older population members become more likely to mutate. This will result in a stronger overall population when the validation step is performed which is desirable as the member of the population with the highest training set fitness may not necessarily have the highest validation set fitness. It is much more useful to have many population members with high fitness than to have a few with exceptionally high fitnesses.

14

Particle Swarm Optimisation

Particle Swarm Optimisation (PSO) is a global search algorithm which is based on the motion of particle in a swarm, where the swarm itself may be seen as an organism. The goal of the swarm is to effectively search for resources and the location of maximum resources is analogous to the location of the minimum validation error of the neural network.

14.1

Overview

The PSO algorithm attempts to mimic the behaviour of any swarm based animal, eg Fish. In this swarm each individual has its own search agenda but also observes the behaviour of the other members of the swarm and adapts its search depending on the search results of other members. As soon as one member finds a point lower than all the other members these members will flock towards it. 25

The particle swarm algorithm has two deign parameters, a social parameter and a cognitive parameter. Respectively these parameters dictate how each particle behaves. A larger cognitive parameter induces a bias towards the particle’s local best and a larger social parameter induces a bias towards the global best of the swarm. The selection of these parameters is based on the requirements of the algorithm, whether the algorithm is expected to converge to minima or whether it is used as a search tool.

14.2

Particle Swarm Implementation

The implementation of the PSO algorithm presented here is based on the algorithm presented in [1] however there are two changes. The random parameter when calculating the velocities used in this implementation is a vector and not a scalar and there is a reset step. Step 1 Initialise the Swarm Generate a random swarm between -0.1 and 0.1 two times the size of the dimensionality of the weights. This was the largest size that would still converge within a reasonable amount of time. Evaluate the error of each particle and record the best solution as the global best WG . Set the current weight vector of each particle as its temporary particle best WP Step 2 Calculate Particle Velocities For each particle in the swarm (i) calculate its velocity as: vi,k+1 = vi,k + c1 r1 (WP ≠ W) + c2 r2 (WG ≠ W)

(29)

Where: r1 , r1 are random vectors c1 is the cognitive parameter (selected between 1 and 4) c2 is the social parameter (selected between 1 and 4)

Step 3 Update Particle Positions Update the position of each particle as: Wi,k+1 = Wi,k + vi,k+1

(30)

Step 4 Calculate New Error Values For each particle calculate the error value of the new weights and perform the following check: If f (Wi,K+1 ) 6 f (WP ), then WP = Wi,K+1 If f (Wi,K+1 ) 6 f (WG ), then WG = Wi,K+1

(31) 26

whenever a new global best is found, it should also be evaluated on the validation set Step 5 Reset Step At every 5 iterations for all particles set: vi,k

= 0

Wi,K

= random(≠1 : 1)

(32)

Step 6 Stopping Citeration Check whether the maximum iterations has been reached or whether stopping criteria has been broken. If not set k=k+1 and repeat from step 2

14.3

Swarm Parameter Discussion

The particle swarm algorithm implemented in this report is a swarm size dominant particle swarm algorithm as opposed to an iteration number dominant particle swarm algorithm. The reason for this is that the particle swarm implemented here assumes that the objective function is heavily multimodal. When small swarm sizes are used the particle swarm algorithm is much more likely to converge quickly to a local minimum than to search a larger space and converge to a global minimum. With this in mind the swarm size was selected to be 2x as large as the number of weights which was the largest reasonable swarm size to ensure computations would solve soon enough. In contrast to the large swarm size the maximum number of iterations was limited to 50. In order to further enhance the searching characteristics of the particle swarm the velocities and positions were reset every 5 iterations. Although these parameters may seem counter-intuitive the reasoning behind them is simple. Particle swarm was selected in this report to be a global optimisation algorithm. If the parameters were set to obtain local minima convergence then the application of particle swarm would simply be a waste of time as there are significantly more powerful local convergence algorithms.

14.4

Benefits of Particle Swarm

The primary benefit of PSO is its ability to converge to global minima. Unlike the gradient based algorithms which have been considered PSO spans a very large area of the search space and thus is able to locate the global minimum solution and converge to it.

14.5

Drawbacks of Particle Swarm

PSO algorithms are notoriously slow as they require many more function evaluations than most other optimisation algorithms. This is due to the fact that at each iteration the 27

objective function must be evaluated for each particle which results in a significant amount of computational expense. Like the GA the PSO also does not consider any gradient information.

15 15.1

Hybrid Methods Reasoning

In the previous sections two types of algorithms were considered namely global search algorithms (GA,PSO) and local search algorithms(gradient based algorithms). Both of these classes of algorithms have their own respective advantages and disadvantages when considered in the scope of neural network training. Which are summarised below in Table 1. For the local strategy all statements which are made refer to Quasi-Newton which is expected to be the best performing local algorithm. Table 1: Comparison between local and global strategies

Advantages

Global Strategies

Local Strategies

Finds global minima

Good methods converge quickly

Good exploitative properties

Very inexpensive to train

Performs well with noisy problems

Utilises Gradient Information

Invariant to the continuity of the function

Deterministic

Slow convergence to exact minima Disadvantages

Very computationally expensive Disregards gradient information Highly stochastic

May converge to maxima Poor explorative properties Sensitive to continuity

From Table 1 it is clear that there is a significant overlap between the strengths and weaknesses of both strategies. Hence it is a logical yet somewhat unconventional step to consider the possibility of coupling local and global strategies into a hybrid algorithm which exploits the exploitative properties of global algorithms but utilises the convergence rates of local algorithms when the region of a global minimum has been found.

15.2

Implementation

A simple approach to couple global and local algorithms would be to simply run the global algorithm for a reduced number of iterations and use the output of the global algorithm as the input to the local algorithm. This coupling was performed and the results thereof will be presented in the experimental section of the report. 28

16 16.1

Stopping Criteria Introduction

When neural networks are trained a decrease in the training set error does not necessarily result in a decrease in the test set or validation set error, this happens when the neural network begins to overfit on the training set. In order to prevent this it is necessary to construct an stopping criteria which considers both the training set and the validation set errors. One such stopping criteria is known as the PQ2 .[8]

16.2

PQ Intuition

The PQ2 stopping citeration is essentially a quantification of the improvement of the error on the training set compared to the improvement of the error of the validation set. The stopping criteria specified by PQ2 may be vocalised as ”When the validation error is increasing by twice as much as the training error is decreasing, stop.

16.3

PQ Criteria

The PQ2 stopping criteria is expressed formally in Equation 33 GL
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.