How to parallelize cellular neural networks on cluster architectures

June 28, 2017 | Autor: Erich Schikuta | Categoria: Cellular Neural Network

Descrição do Produto

© 2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

How to Parallelize Cellular Neural Networks on Cluster Architectures Thomas Weish¨aupl and Erich Schikuta Department of Computer Science and Business Informatics University of Vienna Rathausstraße 19/9, A-1010 Vienna, Austria {thomas.weishaeupl, erich.schikuta}@univie.ac.at

Abstract In this paper we present “Rules of Thumb” for the efficient and straight-forward parallelization of Cellular Neural Networks (CNNs) processing image data on cluster architectures. The rules result from the application and optimization of the simple but effective structural data parallel approach, which is based on the SPMD model. Digital gray-scale images were used to evaluate the optimized parallel cellular neural network program. The process of parallelizing the algorithm employs HPF to generate an MPIbased program.

1 Introduction A CNN represents a simple mathematical model of a massively parallel computing architecture defined in discrete N-dimensional spaces. Basically a CNN can be seen as an array of identical dynamical systems, which are locally connected cells. CNNs have already been applied to a wide variety of important tasks in signal processing, image processing (filtering, edge detection, character recognition and object recognition, etc.), analysis of 3-D complex surfaces (minima/maxima detection, etc.), and solving partial differential equations (see [3] for a survey). Due to the parallelism of the architecture, it can be applied to problems where traditional methods cannot deliver the required throughput (such as video signal processing). This led to the situation that CNNs received a strong acceptance from the research community in the last few years. The CNN architecture combines the idea of massively parallel analog processing typical for fully connected neural networks with the local interaction between cells, characterizing cellular automata. Any cell is connected to its neighbor cells only, which interact directly with each other. Cells not in the immediate neighborhood have indirect effect because of the propagation effects of the dynamics of the network.

The underlying network paradigm of the presented parallel CNN simulation approach is a discrete time version of Chua and Yang’s cellular neural network model [2]. This is a first order system with linear instantaneous connections. This problem proved very well for the parallelization on classical supercomputers (see [4]). In [6] we presented the parallelization results of CNNs on cluster architectures. By our experience in this work and further evaluation runs of CNN algorithms we developed “Rules of Thumb” for the efficient and straight-forward parallelization of cellular neural network (CNN) running on cluster architectures by evaluating a HPF generate MPI-based CNN program on two cluster architectures. These rules are explained and justified by a detailed performance analysis in the following sections. The rules found are equally applicable for the parallelization of other neural network paradigms which are based on a regular problem scheme.

2 Parallelization of neural networks Generally, a number of problem decomposition techniques can be used to construct a distributed cluster application. These are pipeline or dataflow decomposition, functional decomposition and data-parallel decomposition. For the underlying work we used the data-parallel decomposition based Structural Data Parallel (SDP) approach [5] for the parallel simulation of the CNN. This approach was developed to make parallel neural network simulation as simple as sequential one. It allows users, who are inexperienced in high-performance computing to increase the performance of neural network simulation by parallelization. This approach is based on the Single-Program-MultipleData (SPMD) programming model and utilizes the highly specialized programming capabilities of high performance languages for the parallelization process. The structural data parallel approach is based on four steps, which lead programmers from a sequential descrip-

tion to a parallel and efficient implementation of the neural network: • Sequential coding of the neural network • Data structure identification • Distribution schema selection • Automated parallel code generation and analysis From our point of view a CNN is an N-dimensional regular array of elements. We identify three correlated arrays: the input array (u in Formula 1 of [6]) and two output arrays, one for the cell states of time n (result x) and one for the cell states of time n-1 (array y) respectively. Every array element represents one node of a cellular neural network. Following the SPMD approach the structure in focus for the SPMD parallelization are these three arrays. We identified the following problem areas applying the SPMD approach: • workload distribution • latency of communication (handshake occurrence)

Figure 1. Viable distributions for image data over e.g. 8 processors: tile (BLOCK, BLOCK) (A), row-wise (BLOCK, *) (B) and columnwise (*, BLOCK)

• local cache misses • communication throughput (message size, buffering and communication cache size ) • locality of data (no computation flow interruptions by communication) In the next section we describe the development of an influencing factor configuration, which manages the problem areas optimal. We show the influence of different factors on the execution times by comparing the optimal configuration with variations of the program. These variations manage the problems suboptimal only.

3 Five “Rules of Thumb” for CNN parallelization To deal with the mentioned problems we have to change our CNN program in the following areas: • data distribution balance by deploying equally sized data partitions on the processors, • alignment of the correlating array elements of the different arrays, • array element computation according to the alignment in the memory (column-wise, row-wise) depending on the used programming environment,

• schema of data distribution (see Figure 1)1 along the computational work-flow, • overlap along the array partition border elements, and • adjustment of the rectangular array (image rotation). By the definition of influencing factors we get a clear approach to find and justify our “Rules of Thumb”. Based on these factors we performed the analysis in three phases. In the first phase we sketch a “quasi optimal” algorithm by intuitive analytical considerations. In the second phase we evaluate different CNN program configurations by focusing single factors and analyzing the absolute deviations of the execution time. In the third phase we compare the speedup differences considering the distribution factor only. We define the following factors, which influence the absolute execution time of the CNN program on the cluster architecture. Deployment factor , considering load balancing between the processors and the alignment of the elements of different arrays. Loop factor , regarding the array computation along the internal memory alignment. 1 The keyword BLOCK denotes the HPF notation syntax for the specification of the data distribution layout.

Distribution factor , considering the distribution scheme of the array data onto the compute nodes respective to the computational work-flow. Overlap factor , regarding the influence of the overlap along the array partition (block) borders. Adjustment factor , concerning the dependencies resulting from the adjustment of the rectangular array (image rotation). A research and a production cluster were the basis for our work, both are of a Beowulf type clusters architecture. The research cluster is the gescher2 system consisting of sixteen compute nodes, each having one Pentium 4 3.06GHz and 2GB of dual channel DDR-RAM FSB 266 MHz. The compute nodes are connected by a Gigabit Ethernet interconnect. As production cluster the schroedinger2 3 was used. It is the cluster for numerical intensive applications of the University of Vienna and is ranked under the TOP500. This cluster consists of 192 nodes, each having one Pentium 4 2.53GHz and 1GB DDR RAM connected by a Gigabit Ethernet. The CNN algorithm was coded using the PGI pghpf Compiler. Our CNN algorithm uses image data as input of size 1600x1200 pixels and 3200x2400 pixels. As input data we used for the small image a standard image of a digital camera with a resolution of 1600x1200 pixels (see [6]).

3.1 Phase 1: Quasi-Optimal CNN Algorithm By the factors identified we were able to assume a quasi-optimal algorithm for the parallel execution of CNNs, which builds the basis for the deviation tests in the second phase of the analysis. Regarding the deployment factor, the optimal algorithm divides every array in equally sized partitions (blocks). Every processor gets one of these partitions and the corresponding partition of the other arrays is aligned respectively. So all array elements of different arrays with the same index are on the same processor. By this deployment we get balanced processing workload and minimized communication costs (less total latency). By using HPF directives our distribution schemes are uniform. Thus we reach automatically the quasi-optimal configuration for the deployment factor. Rule 1: Distribute data uniformly. Every partition (block) of the data arrays should have the same size on every processor. This results in a well-balanced workload. 2 http://gescher.vcpc.univie.ac.at 3 http://www.univie.ac.at/nic

Rule 2: Align related data. It is important for CNNs that all elements of the matrices x, y and u with the same index are mapped onto the same processor (are aligned respectively). This minimizes the communication costs, because the template matrices of CNNs embody communication with the direct neighborhood cells only. Considering the loop factor we face two possible ways of array computation. These are the column-wise and the rowwise computation. In our case we assume the best results with column-wise computation. The reasons for this are fewer cache misses by the array alignment in the memory. In the second phase we measure the performance losses by implementing row-wise array computation. For the distribution factor we evaluate the three distribution schemes (see Figure 1). We get the best results by the column-wise distribution because of the computational work-flow and the optimized message-buffer usage. So the communication throughput is optimized. For the overlap factor we measure the differences between the three cases: overlap-size zero, overlap-size one and overlap-size two. Along the array block borders, the compiler localizes a portion of a computation prior to the computation that would otherwise require communication during computation. We assume the best result by overlapsize one, because of the CNN algorithm specific communication exceeding array borders by size one. The size of the template matrices A and B determines the overlap-size. The adjustment factor has influence on the length of the communication border region of the array partitions. We use a column-wise distribution schema. By this fact horizontal adjustment ensures a minimal communication border region. Based on the above intuitive analytical considerations we assume that the optimal CNN program algorithm shows the following characteristics: • equally sized array partitions and adequate alignment of the arrays by HPF • column-wise loop execution • column-wise (*, BLOCK) distribution schema • overlap size one • horizontal adjustment In the next section we give a justification for our findings.

3.2 Phase 2: Absolute Execution Time Comparison The quasi-optimal CNN program with the above stated characteristics builds the basis for the following heuristic

n. proc. 1 2 4 8 16 n. proc. 1 2 4 8 16

schroedinger2 loop row-wise dist. row-wise 5.21 3.22 4.64 1.96 1.78 2.02 1.33 2.11 1.20 1.80 gescher loop row-wise dist. row-wise 6.55 2.51 5.08 2.95 2.12 2.74 1.29 2.32 1.15 1.96

dist. tile 2.65 2.71 2.80 2.48 1.70

n. proc. 1 2 4 8 16

dist. tile 2.86 2.76 2.83 2.70 2.53

n. proc. 1 2 4 8 16

schroedinger loop row-wise dist. row-wise 7.10 4.20 5.98 5.39 4.93 5.45 1.88 5.22 1.41 9.54 gescher loop row-wise dist. row-wise 7.69 5.61 6.53 6.26 5.11 6.10 2.31 5.67 1.38 10.53

dist. tile 2.64 2.63 2.71 3.00 5.17 dist. tile 2.76 2.87 2.87 3.05 6.95

Table 1. Factors for small images (1600x1200 pixels) with different number of processors

Table 2. Factors for large images (3200x2400 pixels) with different number of processors

analysis. We see four sensitive factors for the evaluation of the program. Therefore we changed each of the four factors separately and measured the influences on the execution times. The sensitivity factors4 are

Based on these results it can be stated that the loop factor and the distribution factor are the defining factors for the absolute execution times.

• loop execution mode (column-wise, row-wise) • distribution schemes (column-wise, row-wise, tile) • overlap size (one, zero, two) • array adjustment (horizontally, vertically) On the production and the research cluster we used up to 16 single processor nodes. All presented data are average values of 25 runs. We give the aberrations of the single experiments from the quasi-optimal case as multiples of the absolute execution times of the quasi-optimal program. Table 1 shows the results for the changed loop execution and changed distribution schema for “small” images on both clusters, and Table 2 for “large” images. Using row-wise array computation instead of columnwise computation in loops results in 1.2 to 2.1 worse execution times for small images and 1.4 to 4.9 for large images (using more than three processors). The row-wise distribution schema produces 1.8 to 2.7 longer execution times for small images than column-wise distribution and 5 to 9.5 for large images (using more than three processors). The tile distribution schema exceeds by a factor of 1.7 to 2.8 the execution times of the column-wise distribution for small images and by 2.7 to 5.1 for large images (using more than three processors). 4 The

quasi-optimal configuration is shown in italics

Rule 3: Compute arrays along the memory alignment. The computational work-flow for computing two-dimensional arrays is defined by the program language implementing the CNN algorithm. By a adapted array computation cache misses are avoided. The memory alignment of the multidimensional arrays is defined by the program language specification. Fortran use column major order. C, C++ and Java use row major order. Rule 4: Choose a distribute schema supporting the computational workflow. Concerning the program language architecture (see Rule 3), it is also important to distribute the data along the computational workflow. A better communication’s throughput results from column-wise distribution in our case. Table 3 shows the results for varying overlap-size and rotational adjustment of “small” images on both clusters. Table 4 presents similar results for “large” images. Using overlap-size zero in opposite to overlap-size one for the CNN program gives a 0.9 to 1.1 time factor of the execution time for small images and 1.0 for large images (using more than three processors), i.e. shows no apparent influence. With overlap-size two we got similar results. Changing the adjustment of the array from horizontal to vertical had no effect; the absolute execution times are 0.9 to 1.0 times of the optimal program for small images and 1.0 for large images. The presented data show that the overlap factor and the

adjustment factor have only minor influence on the absolute execution time of the CNN program.

n. proc. 1 2 4 8 16 n. proc. 1 2 4 8 16

schroedinger2 overlap 0 overlap 2 1.00 1.00 1.00 1.00 0.99 1.02 0.94 0.95 0.89 0.99 gescher overlap 0 overlap 2 1.01 1.00 0.97 0.99 0.98 0.97 0.88 0.88 1.11 1.01

Rule 5: Forget overlapping and adjustment (almost). Overlapping [1] and the adjustment of the arrays have only minor influence on the execution time for the CNN algorithm. Depending on the size of the array the effect can be positive or negative. The overlap-size is affected by the locality of data elements. Along the array block borders, the compiler localizes a portion of a computation prior to the computation that would otherwise require communication during computation. Because of the uniform data structure the computation and communication work proximately synchronous. The adjustment (landscape or portrait) of an array changes the number of communication border region elements. For small arrays more border region elements produce a better throughput by reduction of latency influence on the communication time. Larger arrays are processed faster by minimizing the communication border region, because of smaller communication data size.

adjust. vertical 0.99 1.01 1.00 0.96 0.98 adjust. vertical 1.00 0.99 0.91 0.88 0.87

Table 3. Factors for small images (1600x1200 pixels) with different number of processors

3.3 Phase 3: Speedup Analysis for the Distribution Factor

n. proc. 1 2 4 8 16

schroedinger2 overlap 0 overlap 2 0.99 1.00 0.99 0.99 0.98 0.98 1.01 0.98 1.00 1.04 gescher overlap 0 overlap 2 1.00 1.00 1.00 1.00 1.00 0.99 1.01 1.01 1.02 1.05

adjust. vertical 0.99 1.02 0.99 1.01 1.06

4

1600x1200 column

3.5

3200x2400 column

adjust. vertical 1.00 1.00 0.98 1.04 1.12

Table 4. Factors for large images (3200x2400 pixels) with different number of processors

1600x1200 row

3

3200x2400 row 1600x1200 tile

2.5

Speedup

n. proc. 1 2 4 8 16

Because of the dependency between the distribution factor and the communication time in cluster environments, in opposite to the loop factor, we made a speedup analysis for the different distribution schemes (see Figure 1). In the course of the analysis we got the speedup by comparing the measured execution times of the parallelized program code running on a number of nodes with the execution time of the serial program on a single node. Thus we got the absolute overhead of the parallelization.

3200x2400 tile 2

1.5

1

0.5

0 1

3

5

7

9

11

13

15

Number of Processors

Figure 2. Speedup on schroedinger2 with different distribution schemes and images sizes

Figure 2 shows the speedup for different data distribution schemes and two image sizes on the production cluster. It is easy to see that the column-wise distribution produces the best speedup. The reason for this situation is the superior usage of the message buffer by the data distribution along the computational work-flow of the program. It is easy to see that the overhead, this is the additional workload of the communication and parallelization infrastructure, of the parallel program compared to the sequential one lies between 70 and 92 percent. Figure 2 shows also that the overhead difference between the column-wise distribution and the other two distributions is over 20 percent. We further learned that the speedup is heavily dependent on the underlying communication hardware and software due to the high overhead rate (about 70%).

A figurative justification of our statement is given by Figure 4, which shows the execution times for parallel CNN processing of a 1600x1200 image for two programs, the quasi-optimal case adhering to the stated “Rules of Thumb” and the worst case ignoring all of them.

4

1600x1200 column

3.5

3200x2400 column 1600x1200 row

3

3200x2400 row

Speedup

Figure 4. Justification for the five “Rules of Thumb”

1600x1200 tile

2.5

3200x2400 tile 2

1.5

1

5 Acknowledgements

0.5

0 1

3

5

7

9

11

13

15

Number of Processors

Figure 3. Speedup on gescher with different distribution schemes and images sizes

The work described in this paper was partly supported by the Special Research Program SFB F011 AURORA of the Austrian Science Fund.

References The execution of the parallel program on gescher confirmed our assumptions. Due to its less mature hardware and software architecture we got lower speedup values (see Figure 3).

4 Conclusion In this paper we presented five “Rules of Thumb” for the straight forward parallelization of cellular neural networks simulation. Summing up these five rules are: • Distribute data uniformly, • Align related data, • Compute arrays along the memory alignment, • Choose a distribute schema supporting the computational workflow, and • Forget overlapping and adjustment (almost).

[1] T. S. Abdelrahman and G. Liu. Overlap of computation and communications on shared-memory networks-ofworkstations. Journal of Parallel and Distributed Computing Practices, 2(2):143–153, June 1999. [2] L. O. Chua and L. Yang. Cellular neural networks: theory. IEEE Transactions on circuits and systems, 35(10):1257– 1272, Oktober 1988. [3] T. Roska and J. Vandewalle. Cellular Neural Networks. John Wiley and Sons, 1993. [4] E. Schikuta. Data Parallel Software Simulation of Cellular Neural Networks. In CNNA’96 - 1996 Fourth IEEE Internationl Workshop on Cellular neural networks and their applications, pages 267–271, 1996. [5] E. Schikuta, T. Fuerle, and H. Wanek. Structural Data Parallel Simulation of Neural Networks. System Research and Info. Systems, 9:149–172, 2000. [6] T. Weishaeupl and E. Schikuta. Parallelization of cellular neural networks for image processing on cluster architectures. In International Conference on Parallel Processing Workshops, pages 191–196. IEEE, Oct. 2003.

Lihat lebih banyak...

How to parallelize cellular neural networks on cluster architectures

Descrição do Produto

Comentários