A Pipeline Technique for Dynamic Data Transfer on a Multiprocessor Grid

July 15, 2017 | Autor: Stavros Souravlas | Categoria: Distributed Computing, Parallel Programming, Computer Software, Data transfer, Dynamic Panel Data

Share Embed

Denunciar este link

Descrição do Produto

International Journal of Parallel Programming, Vol. 32, No. 5, October 2004 (© 2004)

A PipelineTechnique for Dynamic Data Transfer on a Multiprocessor Grid Stavros Souravlas1 and Manos Roumeliotis2 Received January 31, 2003; revised June 29, 2003; accepted November 24, 2003

This paper describes a pipeline technique which is used to redistribute data on a multiprocessor grid during runtime. The main purposes of the algorithm are to minimize the data transfer time, prevent congestion on the ports of the receiving processors, and minimize the number of idle processors. One of the key ideas for this algorithm is the creation of processor classes, ﬁrstly introduced by Desprez et al. [IEEE Transactions on Parallel and Distributed Systems 9(2):102 (1998).] Based on the idea of classes, we create the pipeline tasks used to organize the redistribution of data. Our experimental results show that this pipeline technique can signiﬁcantly reduce the amount of time required to complete a dynamic data transfer task. KEY WORDS: Block-cyclic redistribution; processor classes; pipeline tasks; High Performance Fortran.

1. INTRODUCTION Many complicated parallel computing applications are composed of several stages. As the program proceeds from one stage to another, it may require different distribution of data between several processor sets.(1) Such applications are the alternate direction implicit method(2) and the multidimensional Fast Fourier Transform.(3–5) Data redistribution is therefore required each time the distribution of data to a processor is improper for a certain execution phase.(6) Since the redistribution must be completed during runtime, an efﬁcient algorithm must be adopted. 1

University of Macedonia, Applied Informatics Department, 156 Egnatia St. 54006, Thessaloniki, Greece. E-mail: {sourstav, manos}@uom.gr 2 To whom correspondence should be addressed 361 0885-7458/04/1000-0361/0 © 2004 Springer Science+Business Media, Inc.

362

Souravlas and Roumeliotis

The interest in runtime redistribution was initially motivated by High Performance Fortran (HPF), a parallel programming language.(7,8) Programs developed in HPF use the ALIGN, DISTRIBUTE and REDISTRIBUTE directives to specify data redistributions.(9) The array redistribution in HPF has three types, Block, Cyclic, and Block-Cyclic.(10–12) Thakur et al.(12) have developed algorithms for these redistribution types, that can be implemented in the library of an HPF compiler. The blockcyclic array redistribution is the most regular among the three types of redistribution.(13) The Block-cyclic array redistribution with block size x is referred to as cyclic(x). Each block contains x successive elements. In this paper, we consider the problem of redistributing data form cyclic(r) on a P -processor grid to cyclic(s) on a Q-processor grid. Our algorithm aims at reducing the data transfer time by pipelining carefully created redistribution tasks. The rest of the paper is organized as follows: In Section 2, a brief description of related work appears. In Section 3, we present the preliminaries for block-cyclic redistribution used in this paper. The last part of this section is dedicated to the formulation of communication classes which will be used later on. In Section 4, we present the basic ideas behind our algorithm. In Section 5, we compare our strategy with the bipartite graphs based scheme, proposed by Desprez et al.(1) and the generalized circulant matrix formalism based scheme.(3,14) Section 6 includes our experimental results.

2. RELATED WORK The problem of runtime redistribution between several processors is very important, affecting the efﬁciency of parallel algorithms. For this reason, a lot of effort has been focused on it. The techniques for implementing the redistribution can be divided into two categories: In the early literature, the attention was mainly focused on the problem of building the messages to be exchanged between processor sets with little or no attention paid to the problem of reducing communication overhead.(15,16) In the later literature, attention has also been focused on the problem of communication scheduling, that is, to organize the messages in such a way that the number of communication steps is reduced, to minimize start-up costs.(1,3,4,17) Kalns et al.(15) provided a processor mapping technique for dynamic data redistribution. Their method minimized the amount of data to be moved among processor memories, by mapping logical processor IDs to redistributed data elements, without violating the redistribution patterns.

A Pipeline Technique for Dynamic Data Transfer

363

Thakur et al.(16) provided algorithms for runtime redistribution. Their work is divided into two cases: the general case of Cyclic(x) to Cyclic(y) redistribution, where there is no relation between x and y, and a special case where x is a multiple of y or y is a multiple of x. For the special case, they developed the KY-TO-Y algorithm. Each processor p calculates the destination processor pd of it’s ﬁrst element as pd = mod(kp, P ). Then, the ﬁrst y elements are sent to pd , the next y to pd + 1 and so on, until the end of the ﬁrst block. Then, the other blocks are moved in the same pattern. For the general case, they implemented the greatest common divisor (GCD) and the least common multiplier (LCM) algorithms. The main idea was to redistribute from cyclic(x) to cyclic(m), where m was the LCM or GCD of x, y using the KY-TO-Y algorithm which solved the special case. The problem of message scheduling was ﬁrstly considered by Walker and Otto (17) who worked on the problem of data redistribution from cyclic(x) to cyclic(Kx) on a grid of P processors. They provided synchronized and unsynchronized schemes that were free of conﬂicts. In the synchronized scheme however, performance was reduced by the fact that some processors had to wait for others before receiving data, while the main problem of the unsynchronized algorithm was the necessity for buffering space equal to the data redistributed. Furthermore, the number of steps required for the implementation of those schemes was not minimal. Lim et al.(14) provided conﬂict-free direct, indirect and hybrid algorithms, for moving from cyclic(r) to cyclic(Kr) on a P -processor grid. The direct algorithms sent the message directly from the source to the destination processor. In the indirect algorithm, messages having the same destination were gathered in an intermediate (relay) processor that started an exclusive session with the destination processor. The hybrid algorithm was a combination of the direct and indirect approach, where the ﬁrst d steps performed indirectly and the remaining directly. Park et al.(3) also provided algorithms for minimizing the communication steps and eliminate node contentions in a special redistribution case, moving from cyclic(x) on P processors to cyclic(Kx) on Q processors. Their methods applied to non-all-to-all and all-to-all communication patterns. The problem of message scheduling was successfully dealt with, in a very interesting paper by Desprez et al.(1) Their effort was focused on solving the general redistribution problem, moving from cyclic(r) on a P -processor grid, to cyclic(s) on a Q-processor grid. The main idea behind their algorithm was to create homogeneous communication patterns which they called classes. Paragraph 3.3 presents brieﬂy the way classes were created. Processor pairs in a certain class, exchanged messages of the same

364

Souravlas and Roumeliotis Table

P0 P1 P2 P3

I. Transfer Costs for a Redistribution Case

Random

Q0

Q1

Q2

Q3

90 41 90 23

41 72 23 90

72 90 41 23

41 23 72 72

size. Having created the classes, they scheduled communication by mixing elements that belonged to different classes in such a manner, that the elements of the most costly classes are distributed in the fewer steps possible. This strategy was called stepwise because it minimized the number of the communication steps. They also introduced the greedy strategy that utilized intermediate processors from source to destination, to reduce the total communication cost. Therefore, it required more steps than the stepwise strategy. The following example illustrates how the created classes were used to organize communication between processors. Suppose that for a random redistribution problem, four classes were created: class0 – class3 . Table I presents the cost of communication between processor pairs (P , Q). For instance, the cost of sending a message from P0 to Q1 is 4 time units. The indices next to each number show the class number to which the corresponding processor pair belongs: The stepwise strategy would organize communication in such a way, that the maximum of “expensive” processor pairs exchange messages in a given communication step. A communication step begins immediately after the previous step is completed, and includes at most Q communicating processor pairs. This assures that each processor receives one message at a time, thus avoiding congestions on the receivers’ ports. The cost of each step equals the highest cost associated with the processor pairs that communicate at this step. Table II shows the communication steps that the stepwise strategy would create and the cost of each step. Each row of the table shows the indices of the destination processors to which sources P0 –P3 send a message at each communication step (step 0–step 3). The total communication cost for this redistribution is the sum of the costs of the 4 steps: 9 + 9 + 4 + 4 = 26 time units. This example illustrates clearly that we cannot use communication classes to organize a free-ofconﬂicts communication scheme. For instance, if processor pairs of class0 communicate in one step, congestion will occur at the port of receiver Q0 , since two messages from P0 and P2 will arrive simultaneously. The greedy

A Pipeline Technique for Dynamic Data Transfer Table

Step Step Step Step

II.

0 1 2 3

365

Communication Steps Stepwise Strategy

of

the

P0

P1

P2

P3

Cost of step

0 2 1 3

2 1 3 0

3 0 2 1

1 3 0 2

9 9 4 4

strategy aims at organizing communication in more steps, provided that the use of more steps reduces communication cost. For example, the cost of communication from P1 to Q2 is 9 time units. However, if we use processor 3 as an intermediate processor between sender 1 and receiver 2 the cost will be: 2(communication cost from 1 to 3) + 2 (communication cost from 3 to 2) = 4 time units. The greedy strategy has two disadvantages: (1) it requires more communication steps, thus increasing the startup cost and (2) “cheap” intermediate processors are not always available, thus it cannot be applied in all redistribution problems. For either strategy, the created communication steps were synchronized. Each step begins immediately after completion of the previous step. An extension of this work(18) offered a solution for the case where the communication steps overlap. The proposed solution was obtained by splitting the communication grid into subtables and performing communication scheduling for processors that belonged to the same diagonal of subtables. Since there is no intersection between senders or receivers of diagonal subtables, the communication steps were contention-free, even if they were not synchronized. In this paper, we try to make a more efﬁcient use of Desprez et al.’s idea of classes and organize redistribution tasks in such a manner that the total communication cost is reduced by minimizing the time the processors remain idle, while each processor receives only one message at a time. The key idea of our scheme is to pipeline redistribution tasks that are composed of carefully selected members of the classes. 3. PRELIMINARIES 3.1. Deﬁnitions Deﬁnition 1. The data redistributed in a block-cyclic fashion are represented by an array of size M which is called data array. A data array element is an element of the redistributed data and it will be indexed with i. Indexing begins from zero, thus, i [0 . . . M − 1].

366

Souravlas and Roumeliotis

Deﬁnition 2. A processor grid can be represented by a two-dimensional table. This table is called communication grid. The communication grid is indexed with : = {(p, q) [0 . . . P − 1) × [0 . . . Q − 1)}

(1)

Obviously, p is the source processor index while q is the destination processor index. Deﬁnition 3. Data that are distributed in a block-cyclic fashion are divided to data blocks. If each block contains r elements, then, provided that M divides r, the data array will be divided into Mb blocks where: Mb =

M r

(2)

If M does not divide r then Mb = Mr + 1. We use variable l as a block index that relates data blocks to the processors of the communication grid in a cyclic manner. Therefore l belongs to the interval [0.. MPb ) or [0.. PMr ) (because of Eq. (2)). Finally, variable x indexes the local position of an element inside a block. This means that 0x < r. Deﬁnition 4. The source distribution R(i, p, l, x), is the mapping of a data array element with index i to a processor index p, a block index l and a local position inside the block x, where:(1) l=

i/r , P

p = i/r mod P ,

and

x = i mod r

(3)

From the relationships listed in Eq. (3), we will try to derive a formula for a data element i that is redistributed in a block-cyclic manner. Since l = i/r and p = i/r mod P , it follows that: P i/r = lP + p.

(4)

We also know that: x = i mod r. Thus, from Eq. (4) we derive: i = (lP + p)r + x

(5)

A Pipeline Technique for Dynamic Data Transfer

367

3.2. Communication Cost and Patterns Consider an element that is distributed cyclic(s) on Q-processors. The number of blocks created is Mb = Ms , where s is the block size. Variable m relates data blocks to the processors and its bounds are found M M in the interval [0.. Qb ) or [0.. Qs ) (because Mb = Ms ). The target distribution R(j, q, m, y), is deﬁned in the same way as the source redistribution. Parameters (j, q, m, y) have the same meaning as (i, p, l, x) of the source distribution. Similarly to variable i, we can derive an equation for the distribution of element j : j = (mQ + q)s + y

(6)

Suppose that we try to redistribute a data array from cyclic(r) on P processors to cyclic(s) on Q-processors. In this case, changes will occur for all elements as far as their processor, block and local position indices are concerned. These changes are described by: R(i, p, l, x) = R(j, q, m, y) or: (lP + p)r + x = (mQ + q)s + y

(7)

This linear Diophantine equation is subject to the following restrictions: 0 p < P, 0l <

M Pr ,

0 x < r,

0 q < Q, M 0m < , Qs 0y < s

(8) (9) (10)

However, Proposition 1 can offer us lower bounds for l and m. Proposition 5. Block-cyclic redistribution is a periodical procedure and it repeats in the same pattern for every L = LCM(P r, Qs) data elements, where L is the least common multiplier (LCM) of P r and Qs.(19,20) Therefore, if L < M, we can limit our interest in the ﬁrst L elements of the data array. The remaining elements will follow the same redistribution pattern. Deﬁnition 6. The cost of transferring a message from a sending processor p to a receiving processor q is called communication cost. The communication cost is measured in time units, where a time unit contains

368

Souravlas and Roumeliotis

h clock cycles and h is the number of cycles required by each pipeline segment. To ﬁnd the communication cost for a processor pair (p, q), we compute the number of quadruples (l, m, y, x) that satisfy the redistribution Eq. (7), given the number of sending (P ) and receiving (Q) processors, and the vector sizes of the target (r) and the source (s) redistribution.(1) The redistribution equation can be solved using Euclid’s theorem for solving linear Diophantine equations. It must be mentioned, that if for a given communicating pair (p, q), there is no quadruple (l, m, y, x) to satisfy Eq. (7) , then there is no communication between p and q. This means that for a given multiprocessor grid, there may be all-to-all or non-all-to-all communication between processor pairs. The criterion for the communication pattern in a given network of processors is given below:(1,14) The communication pattern in a redistribution from cyclic(r) to cyclic(s) on a given number of processor is: •

•

All-to-all communication if the greatest common divisor (it will be referred as gcd for the rest of the paper) of P r, Qs, name it g is such that: g r + s − 1 Non-all-to-all communication if g is such that: g ≥ r + s

The variables which are used in this paper are summarized in Table III. Table III. Variable M i j p q r s l m x y g L

Deﬁnitions of Variables in this Paper Deﬁnition

Size of data array Index of an element distributed cyclic(r) on P processors (target distribution) Index of an element redistributed cyclic(s) on Q processors (source distribution) Source processor index Destination processor index Block size of the source distribution Block size of the target distribution Block index in source distribution Block index in target distribution Local position of element inside a block in source distribution Local position of element inside a block in target distribution The greatest common divisor of P r, Qs or gcd(P r, Qs) The least common multiplier P r, Qs or LCM(P r, Qs)

A Pipeline Technique for Dynamic Data Transfer

369

3.3. Communication Classes In this section, we will brieﬂy present how Desprez et al.(1) created homogeneous in terms of cost communication patterns which they called classes. Consider the following function: f (p, q) = (pr − qs) mod g,

(11)

where (p, q) [0 · · · P − 1] × [0 · · · Q − 1], g = gcd(P r, Qs). A pair of processors (p, q) belongs to a communication class k if: f (p, q) = k or (pr − qs) mod g = k

(12)

As Eq. (12) indicates, all pairs of processors that communicate belong to a class of (pr −qs)mod g. To calculate the communication cost for processor pairs of a given class, we need to rewrite Eq. (7) as: mQs − lP r = pr − qs + (x − y)

(13)

We know that g = gcd(P r, Qs), making lP r − mQs a multiple of g. This means that there is an integer λ, such that: lP r − mQs = λg . If we also set z = x − y, then Eq. (13) is rewritten as: λg − z = pr − qs

(14)

If we divide both parts of Eq. (14) with g, we get: (λg − z) mod g = (pr − qs) mod g ⇒ (λg mod g) − (z mod g) = (pr − qs) mod g ⇒ (0 − z) mod g = (pr − qs) mod g ⇒ −z mod g = (pr − qs) mod g. Since z = x − y, it is obvious that −z = y − x. Therefore we derive the equation: (y − x) mod g = (pr − qs) mod g or (y − x) mod g = k

(15)

370

Souravlas and Roumeliotis

Equation (15) states that we can calculate the communication cost for a given processor pair (p, q) by ﬁnding all the pairs of (x, y) for which (y − x) mod g = k, instead of solving Eq. (7). The main characteristics of classes are summarized as follows: •

• •

All processor pairs that belong to a certain class communicate with the same cost (Recall that k is the same for all processor pairs of a given class). The maximum number of classes that exist in a redistribution problem is g. There may be two or more classes that have the same communication cost. This means that the use of classes does not guarantee of a free-of-conﬂicts communication scheme. Instead, processor pairs that belong to different classes must be mixed to produce an optimal communication schedule.

4. BASIC IDEAS The basic idea behind our algorithm is to decompose the redistribution problem into a set of pipeline operations. Each pipeline includes a speciﬁed number of tasks which are responsible for the communication between carefully selected processor pairs. The main properties of the pipeline operations and their tasks are: (1) Each pipeline task handles the transmission of data between processor pairs that have the same communication cost; (2) A pipeline operation cannot include more than one task that handles message transmissions of a certain cost; (3) The time required for the execution of a task equals the communication cost of the processor pairs it includes; (4) The time required for the execution of a pipeline operation equals the execution time of its longest task; and (5) All tasks are scheduled in such a way that receiving processors get one message at a time, thus congestion is avoided. Data redistribution is composed of three stages: (a) The creation of the pipeline tasks stage, (b) The message preparation stage and (c) the sending stage. These three stages, as well as the necessary theorems and propositions are described in the following paragraphs.

4.1. Stage 1: Creating the Pipeline Tasks As mentioned above, the pipeline tasks must be designed in such a way that receiving processors get one message at a time. To satisfy this requirement, we schedule the tasks in such a way that each task includes a

A Pipeline Technique for Dynamic Data Transfer Table IV. Class b(0) b(1) b(2) b(3) b(4) b(5)

371

CPT for P = Q = 6, r = 4, s = 5

Processor pairs (0,0), (0,1), (2,0), (2,1), (1,0), (1,1),

(3,0), (3,1), (5,0), (5,1), (4,0), (4,1),

(1,2), (1,3), (0,2), (0,3), (2,2), (2,3),

(4,2), (4,3), (3,2), (3,3), (5,2), (5,3),

(2,4), (2,5), (1,4), (1,5), (0,4), (0,5),

Communication cost (5,4) (5,5) (4,4) (4,5) (3,4) (3,5)

4 4 3 3 3 3

speciﬁed number of message transmissions where the destination processors differ but the cost is the same. Therefore, our ﬁrst job is to sort out all processor pairs of the communication grid with respect to their communication cost. We will use the idea of classes introduced by Desprez et al.(1) a given processor pair belongs to a speciﬁc class b(k), if k = (pr −qs)mod g. A brief example will show how classes are created. Consider a redistribution case where P = Q = 6, r = 4 and s = 5. In this case, g = 6. We deﬁne as Class Processor Table or CPT, the table that shows the class to which each processor pair belongs to. The CPT for this redistribution example is shown in Table IV. When (p, q) = (4, 3) then pr −qs = 16−15 = 1. Thus, (pr − qs) mod g = 1 mod 6 = 1. This means that the processor pair (4, 3) belongs to class k = 1. In the same way, all processor pairs are grouped in classes. Having deﬁned the processor pairs that belong to each class, our aim is to decide about the following: • • •

The number of pipeline operations and the number of tasks an operation must include. An upper bound for the number of processor pairs that we select from each class to be included in a pipeline task. The number of classes out of which we can select processor pairs to complete a minimum of Q transmissions (one message for each destination processor) for a pipeline operation.

As stated in the beginning of this paragraph, our algorithm tries to eliminate the time that processors remain idle. This can be achieved if each pipeline operation contains the maximum number of tasks, that is, if a maximum number of message exchanges is performed by each single pipeline. If the number of all different communication costs that exist among the b(k) processor classes is d, then a pipeline operation is composed of a maximum of d tasks and can satisfy up to dQ message transmissions, without congestion on the destination ports.

372

Souravlas and Roumeliotis

To deﬁne an upper bound for the number of processor pairs in a class b(k) that will participate in a pipeline task, we will use Theorem 1(for proof see Desprez et al.(1) Initially, we set s = gcd(s, P ) and r = gcd(r, Q). Since s divides P and r divides Q, there must be integers P and Q such that: P = P s and Q = Q r . Also, we set g0 = gcd(P , Q ). Theorem 7. Each class b(k) contains exactly pairs.

PQ g

=

P Q g0

processor

Theorem 7 leads to the following corollaries: (1) The number of sending requests to a speciﬁc destination inside a class is equal to P /g0 , and (2) There are exactly Q different destinations inside each class, thus a pipeline task can satisfy no more than Q message exchanges between processor pairs from a given class, because this would cause congestion in the destination ports. In the example of Table IV we have P = 6, Q = 3, therefore g0 = 3. Each of the classes b(0)–b(5) contains P /g0 = 2 sending requests to each of the Q = 3 different destinations. For instance, in class b(0) there are two sending requests to destination processor 0, from senders 0 and 3. Also, there are three different destination processors inside this class, processors 0, 2, and 4. This means that a pipeline task can satisfy no more than Q = 3 message exchanges between processor pairs that belong to class b(0). To deﬁne the number of classes out of which we can select processor pairs to complete a minimum of Q transmissions for a pipeline operation, we will use Proposition 8. Proposition 8. To complete a pipeline operation with the minimum number of transmissions, we need to select processor pairs from r different classes. Proof. The minimum number of message exchanges for a pipeline operation corresponds to “one message for each destination processor”, that is, Q messages in total. To show that a pipeline with the Q transmissions can be completed using r processor classes, we must consider that a pipeline task can satisfy at most Q transmissions from one class, otherwise congestion will occur. From the relationship Q = Q r , we can easily conclude that we must select processor pairs from r classes to complete Q transmissions. In the following, the creation of the pipeline operations and their tasks is described step by step. These steps will then be illustrated by an example.

A Pipeline Technique for Dynamic Data Transfer

373

Step 1: Solve Eq. (12) for all processor pairs (p, q) to deﬁne the processor classes and create the class processor table (CPT). Step 2: Find the total communication cost for each b(k) from the number of combinations of x and y for which Eq. (15) has a solution for a given pair of processors (p, q). Also, ﬁnd the value of different costs that exist in the scheme, d. Step 3: Start from the class b(k), for which the communication cost Cb(k) is minimum, and get Q processor pairs. If the pairs selected from b(k) can form a task of Q transmissions, that is, if Q = Q , move to Step 4. Otherwise, check if there is a class of the same cost as b(k), to add up to Q − Q pairs. In either case, the processor pairs that task Ti should include, must be such that all destination processor indices differ: Ti = (pλ0 , qµ0 ), (pλ1 , qµ1 ), . . . (pλP , qµQ ), where: µ0 = µ1 = · · · = µQ . Step 4: Find the class b(k) with the next communication cost and repeat Step 3. Tasks with the same communication cost are not allowed in the same pipeline operation. Once a pipeline includes dQ message exchanges, the pipeline operation is completed. Go to Step 5. Step 5: Check the value of d to ﬁnd the number of different costs for the rest of the processor pairs and use Steps 3 and 4 to create the next pipeline operation. Step 6: When all processor pairs are included in a pipeline operation, terminate, otherwise return to Step 1. Consider the redistribution for P = Q = 9, r = 4 and s = 5. In this case, g = 9. The implementation of Steps 1 and 2 leads to the CPT shown in Table V. The value of d is 4, since there are four different communication costs in the scheme varying from 1 to 4 time units (see the transfer costs for each class in Table V). According to Step 3, we get Q = 9 processor pairs from r = 1 class to create a task of Q = 9 transmissions. We can select pairs from any of the two classes b(4), b(6) since they have the same communication cost of one time unit. Suppose that we select from class, b(4). These processor pairs will form the ﬁrst task T0 of the ﬁrst pipeline operation. According to Step 4, the processor pairs of class b(6) cannot be used in any of the tasks for this pipeline because the same communication cost of one unit will appear twice for all destinations. For the same reason the classes b(2) and b(8), b(0) and b(1), b(3) and b(7) are mutually exclusive.

374

Souravlas and Roumeliotis Table V.

Class

CPT for P = Q = 9, r = 4, s = 5

Processor pairs (0,0), (7,0), (5,0), (3,0), (1,0), (6,0), (4,0), (2,0),

b(0) b(1) b(2) b(3) b(4) b(6) b(7) b(8)

(8,1), (6,1), (4,1), (2,1), (0,1), (5,1), (3,1), (1,1),

(7,2), (5,2), (3,2), (1,2), (8,2), (4,2), (2,2), (0,2),

(6,3), (4,3), (2,3), (0,3), (7,3), (3,3), (1,3), (8,3),

(5,4), (3,4), (1,4), (8,4), (6,4), (2,4), (0,4), (7,4),

Communication cost

(4,5), (2,5), (0,5), (7,5), (5,5), (1,5), (8,5), (6,5),

(3,6), (1,6), (8,6), (6,6), (4,6), (0,6), (7,6), (5,6),

(2,7), (0,7), (7,7), (5,7), (3,7), (8,7), (6,7), (4,7),

(1,8) (8,8) (6,8) (4,8) (2,8) (7,8) (5,8) (3,8)

4 4 3 2 1 1 2 3

Table VI shows the ﬁrst pipeline and its tasks T0 –T3 . These tasks include processor pairs from b(0), b(2), b(3), and b(4). In Step 5, we check the value of d, to ﬁnd the number of different costs to the remaining processor pairs b(1), b(6), b(7), and b(8). We have d = 4. Therefore, we follow the Steps 3 and 4 to create the second pipeline (see Table VII). To compute an upper bound for the time the pipeline strategy requires to complete a given redistribution following the steps presented

Table VI.

First Pipeline Operation and its Tasks for P = Q = 9, r = 4, s = 5

Pipeline task (1,0), (3,0), (5,0), (7,0),

T0 T1 T2 T3

Table VII.

(0,1), (2,1), (4,1), (6,1),

(8,2), (1,2), (3,2), (5,2),

(7,3), (0,3), (2,3), (4,3),

(6,4), (8,4), (1,4), (3,4),

(5,5), (7,5), (0,5), (2,5),

(4,6), (6,6), (8,6), (1,6),

(3,7), (5,7), (7,7), (0,7),

(2,8) (4,8) (6,8) (8,8)

1 2 3 4

Second Pipeline Operation and its Tasks for P = Q = 9, r = 4, s = 5

Pipeline task T0 T1 T2 T3

Communication cost

Communicating processor pairs (p, q)

Communication cost

Communicating processor pairs (p, q) (6,0), (4,0), (2,0), (0,0),

(5,1), (3,1), (1,1), (8,1),

(4,2), (2,2), (0,2), (7,2),

(3,3), (1,3), (8,3), (6,3),

(2,4), (0,4), (7,4), (5,4),

(1,5), (8,5), (6,5), (4,5),

(0,6), (7,6), (5,6), (3,6),

(8,7), (6,7), (4,7), (2,7),

(7,8) (5,8) (3,8) (1,8)

1 2 3 4

A Pipeline Technique for Dynamic Data Transfer

375

above, we assume that the communication grid is a network of N processors, where N = max(P , Q). This means then the longest distance among all processor pairs in the network is (log N ). Thus, the cost of a pipeline task is at most (log N ). Each pipeline operation is composed of at most d pipeline tasks executed simultaneously, therefore an upper bound for the time required to complete a redistribution via pipelining is (d log N ). Once the pipeline operations and their tasks are scheduled, the messages must be prepared for transmission. This phase is described in the next paragraph. 4.2. Stage 2: Message Preparation The message preparation phase includes the time in which processors perform internal memory read operations to gather the proper data and form the messages to be sent. Consider the previous redistribution example, where P = Q = 9, r = 4 and s = 5. These reading operations can be described using Fig. 1. The horizontal axis displays the time in clock cycles, while the vertical axis gives the segment number. At clock cycle 1, task T0 of the ﬁrst pipeline operation is handled by the ﬁrst pipeline segment; all sending processors included in task T0 (see Table VI) read a message from their memory. After one clock cycle, the second pipeline segment is responsible for the reading operations of T0 , while segment 1 is busy with T1 . After the fourth cycle all sending processors that belong to T0 complete their reading operation. All the reading operations for the ﬁrst pipeline are completed after the seventh cycle. It is important to note that at clock cycle 5, segment 1 is free, therefore it can handle the reading operations of task T0 of the second pipeline. This is the main difference of this strategy compared to other strategies in literature. Each sending processor does not have to wait for the completion of message reading of all other processors thus, its idle

1 Segment: 1 T0 2 3 4 Fig. 1.

2

3

4

5

6

T1

T2

T3

T0

T1

T0

T1

T2

T3

T0

T0

T1

T2

T3

T0

T1

T2

7

Clock cycles

T3

Message preparation stage for P = Q = 9, r = 4, s = 5.

376

Souravlas and Roumeliotis

time is reduced. We deﬁne the message preparation cost CP REP , as the time required for the completion of the reading operations that occur during the message preparation stage. 4.3. Stage 3: Sending the messages When all messages are read, the pipeline operations execute and generate number of communications between several processors. Figure 2 shows the execution of two pipeline operations for the redistribution example given. Each pipeline operation is composed of four tasks (T0 –T3 ). The time required for the execution of a task equals the communication cost of the processor pairs it includes. This time is measured in time units. Suppose that communication starts at time 1. All tasks of the pipeline operation execute simultaneously, transferring messages from sending to receiving processors. At time 2, messages included in T0 arrive to their destination, while tasks T1 , T2 , and T3 are still executed (recall that the communication costs for T0 , T1 , T2 , and T3 are 1, 2, 3, and 4 respectively). At time 4, the only task which is not completed is T3 . Obviously, the pipeline tasks are completed at different times. Since each task cannot contain more than a message to a given destination, congestion at the receiving processors’ ports is avoided. The execution cost of a pipeline operation CSEND , equals the the maximum communication cost (maxcost) that exists among the pipeline tasks. This is described as: CSEND = maxcost[TN ]

(16)

where TN is a pipeline task, N [1 · · · d]. Since each pipeline task generates a number of communications, it is associated with a startup cost, CST ART U P . The startup cost includes procedure call 1

2

3

4

Segment: 1 T3 T3

T1

T2

T3

4 T0

T1

T2

T3

fist pipeline Fig. 2.

6

7

8

Time (in time units)

T3

2 T2 3

5

T2

T3

T1

T2

T3

T0

T1

T2

T3

second pipeline

Execution of pipelines for P = Q = 9, r = 4, s = 5.

A Pipeline Technique for Dynamic Data Transfer

377

overheads, indexing calculations and error controls. If the startup cost of a task is a, then the startup cost of a N -task pipeline operation, is aN . 4.4. The Total Redistribution Cost The total redistribution cost CREDI ST RI BU T I ON , is the sum of the message preparation cost CP REP , the execution cost CSEND and the startup CST ART U P for all the pipelines created: CREDST R =

n

CP REP + CSEND + CST ART U P

(17)

j =1

where n is the number of the pipeline operations created. 5. COMPARISON OF THE PIPELINE SCHEME WITH KNOWN STRATEGIES In this section, we separately compare our scheme with two of the algorithms found in literature, the bipartite graphs based scheme,(1) and the generalized circulant matrix formalism based scheme.(13,14) The generalized circulant matrix formalism based scheme performs better(3) than the bipartite graphs based scheme, in the special redistribution case where s is an arbitrary multiple of r. The bipartite graphs based scheme solves the general redistribution problem where s and r are relatively prime. This situation cannot be dealt with, by the matrix formalism schedule. In the following, the message preparation cost and the startup cost are not taken into account. Comparisons are only made on communication costs. 5.1. Pipeline versus Matrix Formalism The generalized circulant matrix formalism is used to solve an important instance of the data redistribution problem, that is, moving from cyclic(r) on a P-processor grid to cyclic(s = Kr) on a Q-processor grid. The integer K for this special redistribution case is called expansion factor. To compare the two schemes, we will use an example of a redistribution from P = 4 to Q = 6 with vector sizes r = 2, s = 6. In this case, K = 3. Initially, we applied the matrix transformations indicated by the authors.(3,14) Table VIII shows the communication steps created by the application of these transformations. Then, we applied the steps of our strategy, found in par. (4.1). Tables IX and X show the ﬁrst and the second pipeline operation respectively. Each operation consists of two pipeline tasks T0 and T1 .

378

Souravlas and Roumeliotis

Table VIII. Steps Step Step Step Step Step Step

Steps Created by Matrix Transformations for P = 4, Q = 6, s = 6, r = 2 Communicating processor pairs (p, q)

0 1 2 3 4 5

(0,0), (0,4), (0,2), (0,1), (0,5), (0,3),

(1,3), (1,1), (1,5), (1,0), (1,4), (1,2),

(2,2), (2,0), (2,4), (2,3), (2,1), (2,5),

Communication cost

(3,1) (3,5) (3,3) (3,2) (3,0) (3,4)

6 6 6 3 3 3

The difference in terms of transfer time between the two strategies can be easily explained. A pipeline operation can complete communication in a series of pipeline tasks. Each task generates six communications as shown in Tables IX and X. The execution cost of a pipeline operation equals the cost of the longest task, name it CL . During this CL time, communications of lower costs are also completed. For this example CL = 6, which is the cost of the longest task T1 . In a period of 6 time units communications between processors of T0 are also completed. This is not the case with the matrix formalism. Each step generates only four communications. This number is signiﬁcantly small compared to the 12 communications generated by our scheme. Theoretically, our gain is 12−4 12 = 66%. To compute the total communication cost of the circulant matrix formalism strategy, we simply add the communication costs of all steps, that is, 6 + 6 + 6 + 3 + 3 + 3 = 27 time units. The time required by our scheme, is the execution time of two pipeline operations, that is, 2 × 6 = 12 time units. Table IX. Pipeline task

Communicating processor pairs (p, q) (1,0), (0,1), (1,2), (0,3), (1,4), (0,5) (0,0), (1,1), (0,2), (1,3), (0,4), (1,5)

T0 T1

Table X. Pipeline task T0 T1

First Pipeline Operation and its Tasks for P = 4, Q = 6, s = 6, r = 2 Communication cost 3 6

Second Pipeline Operation and its Tasks for P = 4, Q = 6, s = 6, r = 2 Communicating processor pairs (p, q) (3,0), (2,1), (3,2), (2,3), (3,4), (2,5) (2,0), (3,1), (2,2), (3,3), (2,4), (3,5)

Communication cost 3 6

A Pipeline Technique for Dynamic Data Transfer

379

5.2. Pipeline versus Bipartite Graphs Based Scheme The bipartite graphs based scheme introduced by Desprez et al. is used to solve the general redistribution problem. We will compare the two schemes using three different redistribution cases: • • •

P > Q, non-all-to-all communication grid P = Q, non-all-to-all communication grid P = Q, all-to-all communication grid

For the ﬁrst case, we choose P = 15, Q = 6, r = 2 and s = 3. The implementation of Desprez et al’s idea organizes the processor pairs in four classes: b0 , b1 , b2 , and b5 . These classes are shown in Table XI. The bipartite graphs based scheme offers two strategies for the redistribution problem. The stepwise strategy which minimizes startup costs and the greedy strategy which reduces communication costs. The total cost for our example is 20 time units for the ﬁrst strategy and 18 time units for the second. The number of communication steps required, is 10 and 12 respectively.(1) Our strategy requires ﬁve pipeline operations, each including two tasks (see Table XII) to complete communication. The execution time for each pipeline operation equals the execution time of T1 , that is, 2 time units. Therefore, all processors communicate in 2 × 5=10 time units. This offers an improvement of 20−10 20 = 50% compared to the stepwise strategy and 18−10 = 44% compared to the greedy strategy. 18 In the second example, we have P = Q = 15, r = 3, s = 5. The communication pattern is non-all-to-all, and the number of senders equals the number of receivers. Seven classes of processor pairs were generated. These classes are given in Table XIII.

Table XI. Class

Classes Created for P = 15, Q = 6, r = 2 and s = 3 Communicating processor pairs (p, q)

Cost

b(0)

(0,0), (3,2), (6,4), (3,0), (6,2), (9,4), (6,0), (9,2) (12,4), (9,0), (12,2), (0,4) (12,0), (0,2), (3,4)

2

b(1)

(2,1), (5,3), (8,5), (5,1), (8,3), (11,5) (8,1), (11,3) (14,5), (11,1), (14,3), (2,5), (14,1), (2,3), (5,5)

2

b(2)

(1,0), (4,2), (7,4), (4,0), (7,2), (10,4), (7,0), (10,2) (13,4), (10,0), (13,2), (1,4) (13,0), (1,2), (4,4)

1

b(5)

(1,1), (4,3), (7,5), (4,1), (7,3), (10,5) (7,1), (10,3) (13,5), (10,1), (13,3), (1,5), (13,1), (1,3), (4,5)

1

380

Souravlas and Roumeliotis Table XII.

Operation

Pipeline Operations and their Tasks for P = 15, Q = 6, r = 2 and s = 3 Task

Communicating processor pairs (p, q)

Communication cost

1

T1 T0

(0,0), (3,2), (6,4), (2,1), (5,3), (8,5) (1,0), (4,2), (7,4), (1,1), (4,3), (7,5)

2 1

2

T1 T0

(3,0), (6,2), (9,4), (5,1), (8,3), (11,5) (4,0), (7,2), (10,4), (4,1), (7,3), (10,5)

2 1

3

T1 T0

(6,0), (9,2), (12,4), (8,1), (11,3), (14,5) (7,0), (10,2), (13,4), (7,1), (10,3), (13,5)

2 1

4

T1 T0

(9,0), (12,2), (0,4), (11,1), (14,3), (2,5) (10,0), (13,2), (1,4), (10,1), (13,3), (1,5)

2 1

5

T1 T0

(12,0), (0,2), (3,4), (14,1), (2,3), (5,5) (13,0), (1,2), (4,4), (13,1), (1,3), (4,5)

2 1

Table XIII.

Classes Created for P = 15, Q = 15, r = 3 and s = 5

Class

Communicating processor pairs (p, q)

Cost

b(0)

(0,0), (5,3), (10,6), (5,0), (10,3), (0,6), (10,0), (0,3) (5,6), (0,9), (5,12), (5,9) (10,12), (10,9), (0,12)

3

b(1)

(2,1), (7,4), (12,7), (7,1), (12,4), (2,7) (12,1), (2,4) (7,7), (2,10), (7,13), (7,10), (12,13), (12,10), (2,13)

3

b(2)

(4,2), (9,5), (14,8), (9,2), (14,5), (4,8), (14,2), (4,5) (9,8), (4,11), (9,14), (9,11) (14,14), (14,11), (4,14)

3

b(3)

(1,0), (6,3), (11,6), (6,0), (11,3), (1,6) (11,0), (1,3) (6,6), (1,9), (6,12), (6,9), (11,12), (11,9), (1,12)

2

b(4)

(3,1), (8,4), (13,7), (8,1), (13,4), (3,7) (13,1), (3,4) (8,7), (3,10), (8,13), (8,10), (13,13), (13,10), (3,13)

1

b(13)

(1,1), (6,4), (11,7), (6,1), (11,4), (1,7) (11,1), (1,4) (6,7), (1,10), (6,13), (7,10), (11,13), (11,10), (1,13)

1

b(14)

(3,2), (8,5), (13,8), (8,2), (13,5), (3,8) (13,2), (3,5) (8,8), (3,11), (8,14), (8,11), (13,13), (13,11), (3,14)

2

The bipartite graphs scheme offers a solution with 10 communication steps and a total cost of 26 time units.(1) The implementation of the pipeline based scheme, organizes the communication in six pipeline operations. Table XIV shows the operations 1–3 and their tasks. Initially, task T0 is assigned ﬁve processor pairs from class b(4) that has the lowest cost (1 time unit). Processor pairs that belong to class

A Pipeline Technique for Dynamic Data Transfer Table XIV. Operation 1

Pipeline Operations (1–3) and their Tasks for P = 15, Q = 15, r = 3 and s = 5 Task T0 T1 T2

2

T0 T1 T2

3

381

T0 T1 T2

Processor pairs (p, q) (3,1), (1,0), (3,2), (0,0), (2,1), (4,2),

(8,4), (6,3), (8,5), (5,3), (7,4), (9,5),

(8,1), (6,0), (8,2), (5,0), (7,1), (9,2),

(13,4), (11,3), (13,5), (10,3), (12,4), (14,5),

(13,1), (11,0), (13,2), (10,0), (12,1), (14,2),

(3,4), (1,3), (3,5), (0,3), (2,4), (4,5),

(13,7), (11,6), (13,8), (10,6), (12,7), (14,8),

Communication cost

(3,10), (8,13) (1,9), (6,12) (3,11), (8,14) (0,9), (5,12) (2,10), (7,13) (4,11), (9,14)

1 2

(3,7), (1,6), (3,8), (0,6), (2,7), (4,8),

(8,10), (13,13) (6,9), (11,12) (8,11), (13,14) (5,9), (10,12) (7,10), (12,13) (9,11), (14,14)

1 2

(8,7), (6,6), (8,8), (5,6), (7,7), (9,8),

(13,10), (3,13) (11,9), (1,12) (13,11), (3,14) (10,9), (0,12) (12,10), (2,13) (14,11), (4,14)

1 2

3

3

3

b13 and b4 are mutually exclusive, that is, they cannot be included in the same pipeline operation, because their communication cost and destination indices are the same. Then, T1 is assigned pairs from classes b(3) and b(14) that cost 2 time units. Finally, T2 is assigned communication pairs of cost 3 from classes b(0), b(1), b(2) respectively. Note that classes b(0)–b(2) are not mutually exclusive although they have the same communication cost because their destination processor indices differ. The same holds for classes b(3) and b(14). Similarly, we create three pipeline operations for the communication between processor pairs that belong to classes b(0)–b(4) and b(14). The total communication cost of these operations is 9 time units. Finally, we have to schedule communication for the processor pairs of class b(13) that were excluded from the previous pipeline operations. Three pipeline operations that cost 1 time unit are required (see Table XV). The total communication cost of the pipeline scheme is 3 × 3 + 3 × 1 = 12 time units. Our theoretical gain is 26−12 26 = 54%. In the ﬁnal example, we have P = Q = 16, r = 7 and s = 11. The communication pattern is all-to-all. The implementation of Desprez et al.’s idea would generate 16 classes. Table XVI shows the transfer costs for each class.

382

Souravlas and Roumeliotis

Table XV. Operation

Pipeline Operations (4–6) and their Tasks for P = 15, Q = 15, r = 3 and s = 5 Task

Processor pairs (p, q)

Communication cost

T0 T0 T0

(1,1), (6,4), (11,7), (1,10), (6,13) (6,1), (11,4), (1,7), (7,10), (11,13) (11,1), (1,4), (6,7), (11,10), (1,13)

1 1 1

4 5 6

Table XVI.

Communication Costs of Classes for P = Q = 16, r = 7, s = 11

Class

Communication cost

Class

b(0) b(1) b(2) b(3) b(4) b(5) b(6) b(7)

7 7 7 7 7 6 5 4

b(8) b(9) b(10) b(11) b(12) b(13) b(14) b(15)

Communication cost 3 2 2 2 3 4 5 6

Actually, the bipartite graphs based solution can perform this redistribution by sending all the messages of one class at each communication step. The total cost of communication is 77 time units.(1) The pipeline scheme can offer a solution of only 35 time units simply by pipelining each “expensive” class of 7 time units with two other classes of lower costs. This would lead to ﬁve pipeline operations of 7 time units cost each. However, this solution may be unrealistic for some network topologies and an “intermediate” solution may need to be offered. This is a big advantage for the pipeline strategy: it can offer a redistribution solution with respect to the interconnection network. In this example, we pipeline each of the “expensive” classes (of cost 7) with a “cheap” class (of cost 6, 5, or 4) and send the remaining classes separately (actually without pipelining). This will offer an intermediate solution of 65 time units in total, which means an improvement of ≈ 16% compared to the 77 units cost solution offered by the bipartite graphs based solution (see Table XVII). Of course, many other solutions can be offered, depending on the interconnection network.

A Pipeline Technique for Dynamic Data Transfer Table XVII.

383

Pipeline Operations and Costs for P = Q = 16, r = 7, s = 11

Pipeline operation 1 2 3 4 5 6 7 8 9 10 11

Classes participating

Cost

b(0), b(8) b(1), b(9) b(2), b(10) b(3), b(11) b(4), b(12) b(5) b(6) b(7) b(13) b(14) b(15)

7 7 7 7 7 6 5 4 4 5 6

6. EXPERIMENTAL RESULTS 6.1. Introduction The experiments presented in the last section were contacted on a simulated IBM SP2 parallel architecture. The interconnection is performed via bidirectional links. Each link has a 40 MB/sec bandwidth. Considering that each node sends a number of messages at each communication step, the bandwidth available to a node would be:(21) (18) bandwidthp = 40MB × h where 40 MB is the link bandwidth, is the number of links to a node p, and h is the average number of intermediate hops required for the communication between two processors. If we assume that there is only one link for each node and since h depends on the logarithmic value of P , Eq. (18) becomes: 1 bandwidthp = 40 × (19) log2 (P ) To run the simulations, we need to convert the theoretical time units which we used to evaluate the performance of our algorithm in the previous sections, to real time. If we multiply the number of solutions that satisfy Eq. (7) to the assumed vector size, we get the message size that is transferred between two processors at a step. For example, if Eq. (7) has ﬁve solutions for a given pair of processors (p, q) and we assume a vector size of 1 MB, then processor p will send a message of 5 MB to

384

Souravlas and Roumeliotis

processor Q. Since we know that the bandwidth is 40 MB/sec, we can estimate the time required for the transmission to be accomplished. From Eq. (19) it is obvious that as the number of nodes increases, the available bandwidth is reduced. In the previous section, we mentioned that the pipeline scheme can offer solutions which may be unrealistic for certain networks. For example, in a network of P = Q = 16 processors the bandwidth available for a node is 40× log21(P ) = 10 MB. If the assumed vector size is 1 MB and the redistribution equation has ﬁve solutions for pairs (p, q0 ), (p, q1 ), and (p, q2 ) then only two of the three processor pairs can be included in a pipeline operation, resulting in a non optimal solution. Such situations appear quite often therefore we cannot obtain the theoretical improvement of the pipeline scheme. This is demonstrated in the results section. 6.2. Steps for Running the Simulations in Message Passing Interface (MPI) To show that the application of the pipeline based scheme reduces the communication cost, we run simulations on three redistribution cases. The execution of the simulations requires the following steps: 1. Decide about the processor pairs that form each pipeline task, following the steps illustrated in par. 4.1. 2. Relate each communicating processor with an ID number, id. 3. Create each group of communicating processor pairs (pipeline tasks) by explicitly identifying the processor pair id ’s that will be included in the task. 4. Create a group of pipeline tasks (pipeline operation). 5. Synchronize the execution of all tasks of a pipeline operation. 6. Pack data into buffers for sending. 7. Execute the pipelines. 8. Unpack data from a buffer in the receiving side. The simulator developed contains message passing routines similar to those found in IBM SP2 MPL (Message Passing Library),(21) which provides all the necessary programming interface for creating tasks and passing messages between the system processors. We considered three redistribution cases: Case 1: Block-cyclic redistribution from P = 40 to Q = 60 with expansion factor K = 30. Case 2: Block-cyclic(3) to Block-cyclic(5), P = Q = 15. Case 3: Block-cyclic(7) to Block-cyclic(11), P = Q = 16.(1)

A Pipeline Technique for Dynamic Data Transfer

385

The simulation results given in this section compare message transfer times only: they do not take into account startup and message preparation costs. 6.3. Case 1 The ﬁrst redistribution case compares the pipeline strategy with the circulant matrix formalism scheme. Figure 3 shows the experimental results. The expansion factor was set to 30. The vector size varied from 32 kb to 5 Mb. The circulant matrix formalism algorithm generates 40 communications per step, while each pipeline operation can generate 120. This offers a theoretical improvement of 120−40 40 120 =66%. However, the bandwidth available for each node is log2 (60) 7 MB. Thus, the redistribution can be scheduled with the maximum number of simultaneous communications only for small vector sizes. The simulation results show that the average increase of the communication speed is about 18%, because the high number of communications causes network bottlenecks especially for big vector sizes. 6.4. Case 2 The second redistribution case is the second example of Section 5, that is, redistributing from cyclic (3) to cyclic (5) on a grid of 15 processors. The vector size varied from 32 kb to 1.5 Mb, while the available bandwidth for each node is log40 10.5 MB. The theoretical improve2 (15)

60 50 Redistribution Time (in msec) 40 Matrix Transformations 30

Pipeline

20 10 0 0.1 1 1.8 2.5 3.5 4.2 5 Vector size (in Mb) Fig. 3.

Simulation Results for R(P , K, Q), P = 40, Q = 60, K = 30.

386

Souravlas and Roumeliotis 9000

6000

Redistribution Time (in msec)

Bipartite graphs based scheme Pipeline 3000

0

0.03

0.1

0.38 0.76 1.2

Vector Size (in Mb)

Fig. 4.

Simulation Results for Block-cyclic (3) to Block-cyclic (5), P = Q = 15.

ment of the pipeline strategy is 54% as mentioned in Section 5. The pipeline operations generate a maximum of 30 simultaneous communications, therefore network bottlenecks occur. Our experimental results show that we only obtain an average improvement of 14% using the pipeline strategy. (see Fig. 4). 6.5. Case 3 The last experiment corresponds to the last example of Section 5, that is, redistributing from cyclic (7) to cyclic (11) on a grid of 16 processors. The pipeline strategy could theoretically complete this redistribution in ﬁve pipeline operations. Each pipeline would generate 48 communications and the total cost would be 35 time units. However, this solution can lead to network congestion because the bandwidth available is log40 = 10 MB. 2 (16) Therefore, we decided to run our simulations using a scheme of 11 pipeline operations, each creating at most 32 communications. Again, the simulation results show that we did not obtain the expected improvement of 16% (see Section 5). Our gain is ≈ 12% because the number of communications created is high enough to cause bottlenecks (see Fig. 5). The experimental results show clearly that due to network bottlenecks do not obtain the expected improvement as far as the communication time is concerned. However, our gain in all cases is more than 10% which is an important improvement. 7. CONCLUSIONS In this paper, we used a pipeline based technique to solve a very important for the efﬁciency of parallel algorithms problem, that is,

A Pipeline Technique for Dynamic Data Transfer

387

12000

9000

Redistribution Time (in msec) 6000

Bipartite graphs based scheme Pipeline

3000

0

0.03 0.1 0.38 0.76 1.2

Vector Size (in Mb)

Fig. 5.

Simulation Results for Block-cyclic (7) to Block-cyclic (11), P = Q = 16.

redistributing data from P to Q processors during runtime. Our work focused on making a more efﬁcient use of the idea of classes introduced by Desprez et al., by creating pipeline transmission tasks composed of elements of different classes. The tasks were organized in such a manner that transmissions from sending processors pi to one speciﬁc destination q ﬁnished at different times. Therefore, congestion was avoided. The number of classes created, the number of processor pairs of each class and the classes taking part in each pipeline task were explicitly deﬁned. Our experimental results show that the pipeline technique can signiﬁcantly reduce the total communication cost compared to other known strategies such as the generalized circulant matrix formalism scheme and the bipartite graphs based algorithm. REFERENCES 1. F. Desprez, J. Dongarra, A. Petitet, C. Randriamaro, and Y. Robert, Scheduling BlockCyclic Array Redistribution, IEEE Transactions on Parallel and Distributed Systems, 9(2):192–205, (February 1998). 2. S. D. Kaushik, C. H. Huang, R. W. Johnson, and P. Sadayappan, An Approach to Communication-Efﬁcient Data Redistribution, In Proceedings of the 8th ACM International Conference on Supercomputing, Manchester, England, (July 1994). 3. Neungsoo Park, V. K. Prassana, and Cauligi S. Raghavendra, Efﬁcient Algorithms for Block-Cyclic Array Redistribution Between Processor Sets, IEEE Transactions on Parallel and Distributed Systems, 10(12):1217–1240, (December 1999). 4. L. Prylli and B. Touranchean, Fast Runtime Block Cyclic Data Redistribution on Multiprocessors, Parallel and Distributed Computing, 45:63–72, (August 1997). 5. S. Ramaswamy and P. Benerjee, Automatic Generation of Efﬁcient Array Redistribution Routines for Distributed Memory Multicomputers, Proceedings of the Fifth Symposium on Frontiers of Massively Parallel Computation, pp. 342–349, (February 1995).

388

Souravlas and Roumeliotis

6. L. Wang, J. M. Stichnoth, and S. Chatterjee, Runtime Performance of Parallel Array Assignment: An Empirical Study, Proceedings of the 1996 ACM/IEEE Supercomputing Conference, http://www.supercomp.org/sc96/proceedings, (1996). 7. B. Chapman, P. Mehrotra, H. Moritsch, and H. Zima, Dynamic Data Distribution in Vienna Fortran, Proceedings of the Supercomputing ’93, pp. 284–293, (November 1993). 8. E. M. Miller, Beginner’s Guide to HPF, Joint Institute For Computational Science, http://www.-jics.cs.utk.edu/HPF/HPFguide.html, (August 1998). 9. Y. W. Lim, N. Park, and V. K. Prasanna, Efﬁcient Algorithms for Multi-dimensional Block Cyclic Redistribution Of Arrays, Proceedings of the International Conference on Parallel Processing, pp. 234–241, (August 1997). 10. S. D. Kaushik, C. H. Huang, Sadayappan, J. Ramanujam, and P. Sadayappan, MultiPhase Redistribution: A Communication-Efﬁcient Approach to Array Redistribution, Technical Report OSU-CISRC-9/94-52, Ohio State University (1994). 11. K. Kennedy, N. Nedeljkovic, and A. Sethi, Efﬁcient Address Generation for BlockCyclic Distributions, Proceedings of the 1995 ACM/IEEE Supercomputing Conference, http://www.supercomp.org/sc95/proceedings, (July 1995). 12. R. Thakur, A. Choudhary, and G. Fox, Runtime Array Redistribution in HPF Programs, SHPCC’ 94, Northeast Parallel Architectures Center (1994). 13. Y.-C. Chung, C.-H. Hsu, and S.-W. Bai, A basic-Cycle Calculation Technique for Efﬁcient Dynamic Data Redistribution, IEEE Transactions on Parallel and Distributed Systems, 9(4):359–377, (April 1998). 14. Y. W. Lim, P. B. Bhat, and V. K. Prasanna, Efﬁcient Algorithms for Block Cyclic Redistribution Of Arrays, Algorithmica, 24:298–330 (1998). 15. E. T. Kalns and L. M. Ni, Processor Mapping Techniques Toward Efﬁcient Data Redistribution, IEEE Transactions on Parallel and Distributed Systems, 6(12):1234– 1247, (December 1995). 16. R. Thakur, A. Choudhary, and J. Ramanujam, Efﬁcient Algorithms For Array Redistribution, IEEE Transactions on Parallel and Distributed Systems, 7(6):587–594, (June 1996). 17. D. W. Walker and S. W. Otto, Redistribution of Block-Cyclic Data Distributions Using MPI, Concurrency: Practice and Experience, 8(9):707–728 (1996). 18. F. Desprez, J. Dongarra, A. Petitet, C. Randriamaro, and Y. Robert, More On Scheduling Block-Cyclic Array Redistribution, In Proceedings of the 4th Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, Lecture Notes in Computer Science, Vol. 1511, Pittsburgh, PA, Springer-Verlag, pp. 275–287, (1998). 19. A. Petitet, Algorithmic Redistribution Methods for Block Cyclic Decompositions, Ph.D. Thesis, University of Tennessee, Knoxville (1996). 20. A. Petitet and J. Dongarra, Algorithmic redistribution methods for block cyclic decompositions, IEEE Transactions on Parallel and Distributed Systems, 10(12):1201–1216, (December 1999). 21. http://www.research.ibm.com/journal/sj/342/agerwala.html. 22. K. Hwang, Z. Xu, and M. Arakawa, Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing, IEEE Transactions on Parallel and Distributed Systems, 7(5):522–536, (May 1996).

Lihat lebih banyak...

A Pipeline Technique for Dynamic Data Transfer on a Multiprocessor Grid

Descrição do Produto

Comentários