Exploiting redundancy to speed up parallel systems

May 30, 2017 | Autor: Ernst Leiss | Categoria: Parallel Programming, Redundancy, Fault Tolerant, Parallel Systems

Descrição do Produto

Exploiting Redundancy to Speed U p Parallel Systems I-Ling Yen Michigan State University

Ernst L. Leis and Farokh B. Bastani University of Houston

@Repetitivefault tolerance takes advantage of redundant processors to offerpeak performance during normal execution, and graceful performance degradation when processorsfail. As long as oneprocessor is working, the computation can continue.

An earlier version of this article appeared as “A Repetitive Fault Tolerance Model for Parallel Programs,”on pp. 447-455 o f the Proceedingsof the 26th Hawaii International Conference on System Sciences, published in 1993 hy IEEE Computer Society Press, Los Alamitos. Calif.

August 1993

M

any computing systems require both high performance and high reliability. These range from safety-critical systems such as control systems for aircraft and nuclear power plants, to long-lived, unmaintainable systems ,such as unmanned space vehicles. These systems generally include redundant processors to improve their ability to tolerate faults. When the number of redundant processors is small compared to the total number, then the number of faults the system can tolerate is also small. Increasingthat ratio increases fault tolerance, but it also reduces the number of working processors,which severely degrades performance. In addition, fault-tolerance techniques that diminish system performance can actually reduce system reliability.’ Inherent fault tolerance takes adavantage of redundant processors by using them for parallel computation during fault-free periods.24 When failures do occur, the system is allowed to degrade gracefully. This approach embeds fault tolerance in the algorithm, eliminating overhead from checkpointing, failure detection,and redundant computations. However, it is difficult to develop a general method to obtain inherently faulttolerant programs. Also, inherent fault tolerance seems more suited to goal-oriented computations such as sorting and process-control applications than to dataflow computations such as matrix multiplication. Another approach - algorithm-basedfault tolerance - is suitable for dataflow computations, although it is sensitive to the application area.5 It has been used successfullyto develop fault-tolerant algorithmsfor matrix multiplication and LU decomposition, fast Fourier transform,6 and

1063-6552/93/0800-0051 $3.00 0 1993 IEEE

51

Formal notationf o r mpetdtiaef a d Thread

be null statements);likewise, we say that each statement block has M stages; for example statementsSOJ and $ 1 , ~are in stage 2. Let’s use a simple matrix summation to illustrate the notation. Given two N x MmatricesA and B, we can divide the program into N threads, each of which computes the sum of a column of elements of the two mamces:

Statements

Processor

I

pfO

To:

sol

so2

.

SO,M

Ti:

~ 1 1

~1,2

...

Si,M

...

...

...

...

...

...

TN-1:

SN1,l

sN-1,2

...

SN-l M

PEN-1

t

PE1

Thread k For]:= 1 to MDo CkJ Ah.) Bk,j ; End Thread

Thread Tk consists of statementssk,O to S k a , where skJ is CkJ:=AkJ+ Bkr Also, stagej in this statementblockincludes StatementS ckJ:=Akd+Bkr If no processorsfad, then each processor PE, sets bit V,

sorting. T h e algorithm-based approach uses a small number of redundant processors to perform redundant computations (incurring some overhead), and it allows forward recovery &om faults. This approach also incurs minor overhead (in time) by checking the correctness of the result. However, algorithms designed in this model tolerate only a few faults. Extra redundancy is required to increase the number of faults tolerated. Also, as with inherent fault tolerance, there is no general method for designing algorithm-based fault-tolerant programs. We have used the underlying principle of inherent fault tolerance -turning redundancy into computation power - to design a model of repetitivefault tolerance that is suitable for dataflow computations.’ When no processors fail, they all work in parallel to achieve performance almost equal to that of the parallel program without fault tolerance. If processors do fail, the program can still derive the correct result as long as at least one processor is working; failures only slow the computation speed. Repetitive fault tolerance also provides a systematic way to derive fault-tolerant programs.

Repetitive fault tolerance Our repetitive fault-tolerance method uses the singleprogram, multiple-data model, and we can apply it to both SPMD and SIMD architecture^^^^ in which a global synchronization line performs barrier synchronization. Our model is designed to tolerate processor failures in which the processor stops its computation when it fails. I t also contains a fault-tolerant shared memory that can be based on either physically shared 52

1

or distributed memory systems. (We do not discuss the fault tolerance mechanism for memory systems in this article. Mechanisms that handle memory failures, such as checksum schemes,*Ocan be used independently to tolerate memory failures.) Our model views a program as a sequence of statement blocks, each of which can be a procedure or simply a block of statements.Startingwith a statement block S, the goal is to construct a fault-tolerant block B whose correctness and termination are guaranteed in spite of processor failures. We can then make a program fault tolerant by constructing a fault-tolerant block for each statement block. Figure 1 shows the flow diagram of fault-tolerant block B. When the execution flow enters B, the statement block S executes once. T h e statement block is a single-entry, single exit block. It consists of parallel threads, each assigned to a processor in the system. When a processor successfullyexecutes its thread, it raises a bit in the termination control vector V,which has a bit for each processor and is connected to a global bus. When all the bits are raised - when S has completely executed - then the global bus is raised, execution terminates, execution control leaves B, and the raised bits are cleared. The performance penalty is only two time units: one to raise the bit and one to clear the vector. When the number of statements in each thread is large, this performance penalty is insignificant. However, if a processor fails, the thread assigned to that processor remains unexecuted, and the corresponding bit in Vis not raised. T h e global bus is not raised, indicating that more iterations are required. The IEEE

Parallel & Distributed Technology

termination control vector passes control to a permutation function P, which permutes the processors; eventually, a working processor will represent the failed processor and execute its thread (see the sidebar for the formal notation). Actually, we could reduce hardware costs by using the synchronization line in the SIMD and SPMD systems for both synchronizationand termination contr01.~Also, the termination control vector is itself vulnerableto failures (as is the synchronization line). Thus, we should replicate this component to make it fault tolerant; doing so will not be as costly as replicating processors. T h e Raise and Clear operations can be implicit: T h e underlying system can raise bits and clear the vector once a statement block is marked as fault tolerant.

Figure 1. Flow diagram for fault-tolerant block B.

Table 1. Permutations generated by the standard permutation function.

PERMUTATION DISPLACEMENT

001 010 100 011 101 110 111

PROCESSOR PERMUTATION Processor PEi is initially assigned to execute thread Tj, so the original permutation Po is the sequence (0, 1, ..., N-1). T h e permutation function F generates the permutations (Pl,Pz, ..., PN-1) based on Po. These direct the processor permutations. If permutation Pk - the kth sequence generated by F - is used to guide the processor permutation for the repeated execution of S, we say that Pk has been applied to Suppose Pk is (ak,O ak,l ... ak,N-1) and j = processor PEj will represent processor PEi to execute its assigned thread TP However, no two sequences generated by F should have the same value at the same position. In other words, the permutations must not overlap: For all i,j (0 S i,j < N, and i #I) such that Pi= (ai.0 ai,l ... ai,N-1) and = aj,1 ... ai, N-l), we have a&,# ai? Many combinations of permutations satisfy this requirement. One of these - the standard pemzutation finction7 - generates N-1 permutations based on Po = (0, 1, ...,N-1): F = Pi(abj)= ao,j 0 Hi, for all i,j (1 < i < Nand 0 time using O(111) processors (M = 2" for some positive integer m): Parallel for all i ( 0 5 i < M) Do j : = reverse of the binary representationof i ;

All := 4il;

End Parallel:

For p:= 1 To m Do Parallel for all i ( 0 5 i < M) Do If the pth least significant bit of i i s 0 Then j : = 2p-1 + i ; temp := W,i* A!]; 44 := 411t temp; x[/] := 411 -temp; Endlf;

First, an input vector X i s permuted as x[z] t X[i'], where'i is the reverse binary representation of i. This is computed in parallel. The algorithm includes log M = m phases, and it performs the followingcomputation a t the pth phase (1 5 p 5 m): x[z] t x[z] + Wpix 3 1 , and q] t x[z] - Wp' x 3 1, for all i where thepth bit of i is 0,j = i + 2P-', and WP -- d 2 l d 2 p ) . Let's construct a fault-tolerant block for the transformation portion of h s program. We choose the statements in the For loop as one statement block, which has IEEE Parallel & Distributed Technology

Table 2. Performance degradation d u e to failures (crossover dependency). ~

~~~

PROCESSORS f

a simple dependency. This block is not idempotent, since vector Xis defined before entering it and is referenced in the block while being overwritten there. As in the matrix multiplication, we'll replace X with paired variables Xand X'; For p:= 1 To m12 Do Begin FTB DataDependency = Simple Thread i,0 Iis. M - 1: If the pth least significant bit of i i s 0 Then j : = 2P-1 + i; temp := W i * X' [q := $1 t temp; X' [/I := 44 -temp: Endlf; End Thread; End FTB;

m];

Begin FTB DataDependency = Simple Thread i,0 s i s M- 1: If the pth least significant bit of i i s 0 Then j : = 2p-1 + i; temp := W * 44:= X' t temp; x[/] := X' [ I ] -temp; Endlf; End Thread; End FTB;

61 m];

EndFor;

Performance If no processors fail, our model requires only two additional time units to completely execute statement block S: one to raise bits in the termination control vector, and one to clear the vector. When S has a large number of stages (statements in each thread), the performance penalty for normal execution is insignificant, regardless of whether the block has simple or crossover dependency. But the usefulness of a fault-tolerant mechanism also depends on performance degradation due to failures. W e measure this using DRV), the performance degradation ratio for failure ratiof- the time required to execute a fault-tolerant program in an N-processor system with jN failed processors, divided by the time required to execute it when there are no failures. For repetitive fault-tolerant programs, DRV) is not affected by the specific code in the program, but it can be affected by the number of processors N, the permuAugust 1993

64

112 21.12 114 9.22 118 4.68 1/16 3.18 1132 3.00 1/64 3.00 11128 11256 11512 111,024 112,048 ~

~~

~

128

256

512

1,024

28.28 10.04 5.86 3.76 3.18 3.00 3.00

34.94 11.84 6.70 4.26 3.30 3.06 3.00 3.00

39.32 13.44 8.18 5.56 3.60 3.06 3.06 3.00 3.00

46.14 16.42 10.06 6.50 4.16 3.24 3.06 3.00 3.00 3.00

~~

~

~

~~

2,048 20.62 10.52 7.36 4.76 3.66 3.30 3.12 3.00 3.00 3.00

~

tation function, and the type of data dependency. In the worst case, DRV) =jN+1for a statement block with simple dependency, and @V+l)(/N+2)/2for a block with crossover dependency. However, the average DRO is a better indicator of the actual degradation.Figure 4 shows the average DRV) for statement blocks with simple dependency. Execution time only doubles up to a failure ratio of about 1/32 for a large number of processors (up to 1,024), and 1/16 for 64 or 128 processors. This is encouraging, since a single failure requires twice the execution time when there are no redundant processors. Statement blocks with crossover dependency have a worse performance degradation curve (see Figure 5 and Table 2 ) .

IMPROVING PERFORMANCE Although repetitive fault tolerance by itself grves a satisfactory performance degradation ratio, we can further improve performance for systems with physically shared memory and for statement blocks with crossover dependency.

Self-reconfiguration With physically shared memory, the cost (in time) for any processor to access any thread is the same; it costs as much for processor PEi to execute a thread T k as for processor PE,. If there are Wworking processors, the most efficient method is to allow them to execute the first Wthreads, then the next Wthreads, and so on, until all threads are executed. However, since the W working processors are distributed among Ntotal processors, we must reindex the worhng processors consecutively. Reindexing requires O(1og N)time if there are Nworking processors, but it is much more complex if there are arbitrary failures, making this approach infeasible. In another method, all processors check a shared variable to determine which thread to execute, but this scheme is effective only when Wis small. Our self-reconfiguration method does not explicitly 57

0 0

8.0

Original Self-reconfiguration

O.O

’& Original

8.0

-

P

6

Self-reconfiguration

- 6.0 *-

E 4.0

0

0.0 1 0

I

I

I

I

2

4

6

8

loa (fm

C

I

Figure 6. Performance improvement using selreconfiguration (simple dependency).

Figure 7. Performance improvement using selfreconfiguration (crossover dependency).

reassign processors after some processors fail. Instead, a processor PEj will execute all subsequent statement blocks in place of PE, if i > j and if PEj realizes when it “represents” PE,.that PE, did not execute its assigned thread. (We can determine whether a processor executed its thread by examining the corresponding bit in the termination control vector.) T h e system eventually stabilizes to a state in whch the first Wprocessors are “virtually” working processors. Thus, system performance at failure time improves significantly. Let’s assume that processors PE2 and PE7 have failed while executinga statement block with eight threads. In the first iteration, threads T2 and T7 are not executed, so the second and seventh bits of the termination control vector are not raised. In the next iteration, the standard permutation function causes PE2 to represent PE7 and PE, to represent PE2.This permutation does not help, so execution continues for a third iteration, where PE6 represents PE2 and PE3 represents PE7. Using selfreconfiguration, PE6 replaces PE2, but PE3 does not replace PE7 (since 3 < 7). Now processors PE, through PE4 are “working,”and only two iterations are required to complete the execution of all subsequent statement blocks. Self-reconfiguration minimizes performance degradation after the reconfiguration has stabilized. With processor failures, the system’s computation time increases

no reconfiguration has been performed, and that of the second statement block, where processors are partially reconfigured (N = 1,024).

fl

compared to when all the processors are workmg. Given Nexecution threads and N/2 or fewer processor failures, the execution time doubles; with (3/4)N to (N/2)-1 processor failures, it increasesfourfold; and so on. Self-reconfiguration improved our model’s performance (see Figures 6 and 7) We measured the performance degradation of the first statement block, where 58

Consistent repetition and iteration number estimation For statement blocks with simple dependency, computation will be correct as long as all threads execute at least once in any order. However, for statement blocks with crossover dependency,computations performed in the rth iteration are discarded during the (r+l)th iteration: Threads executed in the rth iteration might produce incorrect results due to data dependencieson other threads that did not execute because of processor failures. As mentioned earlier, crossover dependency requires applying permutations Po through P, to statement block S to compute the rth iteration. One way to improve performance for statement blocks with crossover dependency is consistent repetition: We find out how many iterations, R, are required to completely execute the first statement block, and then apply this same number of iterations to subsequent blocks until the number of failed processorsfl changes. When does change, we check the next statement block that completely executes to obtain a new R to apply to subsequent blocks. T h s method limits increases in computation time to

Lihat lebih banyak...

Exploiting redundancy to speed up parallel systems

Descrição do Produto

Comentários