Experimental Assessment of Parallel Systems

August 6, 2017 | Autor: Henrique Madeira | Categoria: Fault Injection, Parallel Machines, Fault Tolerant, Parallel Systems, Ftcs, Error Detection

Share Embed

Denunciar este link

Descrição do Produto

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

Experimental Assessment of Parallel Systems 1 João Gabriel Silva, João Carreira, Henrique Madeira, Diamantino Costa, and Francisco Moreira Universidade de Coimbra - Dep. Eng. Informática Polo II - Pinhal de Marrocos P-3030 Coimbra - PORTUGAL {jgabriel,jcar,henrique,dino,fmoreira}@dei.uc.pt [10], [23], [15]) we go down to a very low 1 to 2 hours MTBF. Although no reliable calculations of the MTBF of big parallel systems are known to the authors, the order of magnitude of the numbers just presented should be quite close to reality. For instance, the USA’s Oak Ridge National Laboratory (ORNL) publishes on the World Wide Web the MTBFs measured for their big Intel Paragon Machines. For the XP/S 150, with 1024 Intel i860 processors, the average MTBF is around 20 hours. This MTBF is calculated as (quoted from the text in the World Wide Web site): “the average time the machine runs between reboots. A reboot may be performed to recover from a crash (not always a hardware failure – the machine just "locks up"). Reboots may also be done for performance reasons; for example, when the machine is running very slowly. Scheduled interrupts (such as reboots for preventative maintenance) are not included in the statistics.” This means that hardware and software faults are not separated, but also means that faults that do not lead to crashes or a significant slow down are not taken into account. The Intel Paragon architecture has no significant faulttolerance mechanisms, just like other vendor’s massively parallel architectures, like the Cray T3D, the IBM SP2, and the Parsytec PowerXplorer that was used in our experiments. Error Detection and Correction (EDC) in main memory is the norm, and the use of CRCs and even EDC in communications is normal. But, at the processor level, only memory protection and little else is available to detect transient faults. For permanent faults, all the machines rely on rather intense periodic testing. The processor appears clearly as the “Achilles heel” of these machines, and was the main focus of our research. A very relevant question is that of knowing what level of confidence can be laid on the results produced by such

Abstract In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel machine running several real applications. The results showed that a significant percentage of faults caused the system to produce wrong results while the application seemed to terminate normally, thus demonstrating that fault tolerance techniques are required in parallel systems, not only to assure that long-running applications can terminate but also (and more important) that the results produced are correct. Of the techniques tested to reduce the percentage of undetected wrong results only ABFT proved to be effective. For other simple error detection methods to be effective, they have to be designed in, and not added as an after thought. Faults injected in the communication subsystem proved the effectiveness of end-to-end CRCs on the data movements between processors.

1. Introduction. Parallel systems composed of thousands of complex state-of-the-art processors can be found in many supercomputing centres around the globe. Unfortunately, as the number of nodes grows, the probability of having faults in the system also increases. If each processor memory pair has an MTBF of more than 10 years, let’s say 100.000 hours, with 10.000 processors, the overall MTBF will be around only 10 hours. Furthermore, if we remember that the “standard” MTBF is calculated taking into account only permanent failures, and we correct it to consider also transient faults (from 4 to 10 times more frequent than permanent ones 1

This work was supported in part by the Commission of the European Union through the Esprit project 6731 - FTMPS “Fault Tolerant Massively Parallel Systems”

1

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

machines. The investigation reported in this paper tries to give some hints for an answer. The fault injection results we present show clearly that fault-tolerance is a must in parallel systems having a large number of processors, not only to allow proper completion of long-running applications but also to obtain confidence in the results. In fact, we show that a significant percentage of the transient faults that go undetected lead to wrong application results, unnoticed by the user. That is, the application terminates and everything seems to be normal (no errors have been reported), but the results delivered to the end user are wrong. The methodology used in this research was that of injecting transient faults in a representative parallel system running realistic applications. To the best of our knowledge, this is the first time that such experiments are reported. The system used was a Parsytec PowerPC based machine, in two configurations: a standard one and an “enhanced” one. The enhanced version includes several concurrent error detection methods that do not require changes in the hardware. An important presumption of this study is that traditional fault tolerance techniques based on replication and voting cannot be applied for cost reasons, so that faults can only be tolerated by using some sort of checkpointing. The needed error detection mechanisms must operate concurrently with the normal system operation in order to detect transient faults, but their cost in terms of memory and performance must be kept low. We consider in this paper several methods under these constraints. The first part of the paper contains the description of the experimental setup used for injecting processor-level faults (section 2) and the results of the experiments made with the standard machine (section 3). Then the results obtained with the “enhanced” version are presented and discussed (section 4). A short description is made in section 5 of experiments made with communication level faults. Finally, in section 6 some conclusions are drawn and some lines of future research are suggested.

extent and location of the injected fault. Another alternative is heavy-ion radiation [13], but it still has a low controllability and the injection setup is complex to build and operate. Since there is no direct way of knowing when a fault was injected, the outputs of a second “gold” processor, unaffected by the faults, have to be compared pin by pin and cycle by cycle with the injected one. Any discrepancy between the two signals indicates an error, but again internal caches can lead to a very late detection. Besides, due to internal asynchronies of modern processors, such a setup is often impossible to build. A much better alternative is Software Implemented Fault Injection (SWIFI), also known as fault emulation [22][5][9][25][11][12][6]. It has the advantage of having low complexity, low development effort, and low cost (no specific hardware is needed). In addition, SWIFI tools have increased portability and do not have problems with physical and electrical interference which are common in physical fault injection tools. The main limitation of SWIFI tools has been the limited fault model that they offer, and their substantial interference with the injected programs. Fortunately, the same processor complexity that made physical fault injection useless or difficult to apply, was instrumental in making SWIFI a powerful method, because manufacturers included in modern processors sophisticated debugging features that can be used to overcome the initial limitations of SWIFI. The tool we used in our experiments, called Xception [3], follows that path. By directly programming the debugging hardware inside the target processor, Xception can inject faults with minimum interference with the target application. Unlike previous software fault injectors, with Xception the target application is not modified, no software traps are inserted, and it is not necessary to execute the target application in special trace mode (the application is executed at full speed). The sophisticated exception triggers available in most of the modern processors allow the definition of many fault triggers (events that cause the injection of the fault), including fault triggers related with the manipulation of data. Another important aspect is the fact that, since Xception operates at the exception handler level (hence it's name), and not through any service provided by the operating system, the injected faults can affect any process running on the target system, including the operating system itself. Furthermore, it is also possible to inject faults in applications for which the source code is not available. The current version of Xception is targeted for the IBM/Motorola 601 PowerPC processor, and runs under the Parix Operating System from Parsytec. The machine

2. Experimental setup. With an ever increasing transistor budget, the designers of modern processors have implemented onchip caches to boost performance. But that made it virtually impossible to monitor what happens inside the processor just by looking at the package pins. Pin-level fault injection tools like MESSALINE [1] or our group’s RIFLE [18] were thus rendered useless. To inject faults inside such chips, power supply disturbances [19] and electromagnetic interference [14] are possible techniques, but they are unable to control the

2

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

used in the experiments reported in this paper is a Parsytec PowerXplorer with 4 nodes, each one composed of a PowerPC 601 processor for number crunching and a Transputer T805 for the communications, with 16 Mbytes per node. Several kinds of events can trigger the injection of a fault: a specific instruction being fetched, a certain address being used for a load or for a store, or a certain time having elapsed from the start of the program. The faults injected can be bit flips or stuck-at faults, lasting for at least one memory cycle, and can affect one or more bits at a time. Xception is able to emulate faults in the following functional units: Data Bus, Address Bus, Floating Point Unit (FPU), Integer Unit (IU), General Purpose Registers (GPR), Branch Processing Unit (BPU), Memory Management Unit (MMU), and Main Memory. For instance, to inject a transient fault in the Integer Unit, Xception corrupts the result of the first instruction that, after the trigger, uses the Integer Unit, like for instance an “add” instruction. For the other functional units similar strategies are followed. The process of defining the faults to inject can be either manual or automatic. For the experiments presented in this paper the fault sets were always defined automatically. The fault parameters have been generated in such a way that all the faults have been injected at a random time during the application execution, each fault caused only one bit flip affecting a random bit associated with a randomly chosen processor functional unit, and lasting for only one instruction cycle. Only one fault is injected in each run of the program after the injection of the fault and the collection of results, the whole system is rebooted for a new injection, in order to guarantee a clean starting state. The results gathered by Xception after each fault are of two kinds: the outcome of the application and the behaviour of the error detection methods (which of them have detected errors and with what latency). The outcome of the application is divided in four categories:

a file) was incorrect. Note that the case where the output file does not exist or is empty is considered as “detected”. The correctness of the output of the program is determined by running the program always with the same inputs, and comparing the output of each run with that of a golden run made at the beginning of the experiments. In our experiments we used three different programs: Ising, Montecarlo and Matmult: Ising - This program, written in C, is a simulation model of physical systems, such as alloys, magnetic materials, polymers and fluids. This particular program simulates in two dimensions the spin changes of Spinglass particles at different temperatures. Spin-glass is a metal (Ag, Cu, Au, Mg, ...) doped with a transition element (like Cr, Mn, Fe, Re,...). Ising is a typical geometric decomposition application, where the overall problem is solved by mapping subregions of the surface among the several processors. Monte-Carlo - This program, written in FORTRAN, is an implementation of the MMC-algorithm for electron beam dose calculations. It simulates the transport of an electron beam of any arbitrary energy and angular distribution, incident on the x-y plane, through a 3D voxel-geometry and scores the resulting dose distribution in a 3D rectilinear volume. The algorithm is quite different from domain decomposition and can be regarded as representative of a class of applications from the field of optics. It is a commercial program used for certain medical applications of electron beam radiation. Matmult - This program just multiplies two big matrices. We have chosen it in order to make a simple assessment of the impact of using Algorithm Based Fault Tolerance (ABFT) on the overall behaviour of a parallel machine.

3. Processor level injection - standard machine. The objective of the experiments reported in this section was dual. First, we wanted to have an idea of how vulnerable parallel machines are to processor faults. We have seen before that present parallel machines have no specific mechanism to protect them against that kind of faults. The machine we used for our experiments is quite representative in that respect. It uses the PowerPC 601 processor at 80 MHz (12.5 ns period), whose complexity is similar to the Alpha used by Cray, the i860 used by Intel and the Power2 processor used by IBM. The operating system is also comparable in that it has a standard memory protection with a kernel running in supervisor mode and user processes in user mode. It

system crash - the injector had to reboot the whole system after a certain time-out, because the system stopped responding. detected - at least one error detection method caught an error. correct result - nothing was detected, the application finished normally with correct results. The fault was somehow absorbed by the natural redundancy of the program and the computer. undetected wrong results - nothing was detected, the program terminated normally, but the output of the program (all programs used have their output directed to

3

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

deviates slightly from the usual Unix model only in the fact in supervisor mode there is no memory protection. Second, we wanted to have a baseline to compare the results of the experiments described in the next section, where we use an "enhanced" version of the machine.

System outcomes correct results undetected wrong result crash detected total injected faults

wrong results and there is nothing to warn the user of that fact. We think that our experiments clearly show that undetected wrong results can happen. What is difficult to establish is how often do undetected wrong results really

Ising No. Faults Percentage 1229 60.5% 296 14.6% 17 0.9% 489 24.0% 2031 100%

Montecarlo No. Faults Percentage 1015 66.1% 230 15.0% 17 1.1% 273 17.8% 1535 100%

Matmult No. Faults Percentage 1425 69.2% 201 9.8% 61 3.0% 371 18.0% 2058 100%

Table 1. Processor faults injected in a Parsytec PowerXplorer. The global results are presented in Table 1. The number of faults injected in each program is not very high since, due to the complexity of the system and the fact that a reboot has to be made after each injected fault, the injection process is very slow. The experiments Detection method memory protection illegal instruction O.S. assertions other exceptions Total detected errors

happen, since the failure rates of production machines are unknown, when transient faults are taken into account. For the present parallel machines, we think that our experiments clearly establish the need to investigate how often do undetected wrong results really happen.

Ising No. Faults Coverage 405 19.9% 59 2.9% 12 0.6% 13 0.6% 489 24.0%

Montecarlo No. Faults Coverage 208 13.6% 57 3.7% 8 0.5% 0 0% 273 17.8%

Matmult No. Faults Coverage 275 13.3% 67 3.3% 27 1.3% 2 0.1% 371 18.0%

Table 2. Breakdown of detected errors according to the detection method. presented took many months to be validated, and close to two weeks for the final run. The many intermediate results obtained during the months taken by the validation of the injection tool and of the experimental setup itself, have shown that the results presented here do not vary significantly when the number of faults injected is increased. Many faults that are undetected just do not have any impact on the application results (outcome - correct results). They are, somehow, absorbed by the inherent redundancy of the machine and the program. For instance, consider a conditional jump that is meant to jump if a variable, whose current value is 4, is smaller than another variable, whose current value is 7. If a fault changes one bit and the 7 turns to a 15, since 4 is still smaller than 15, the jump will go in the correct direction and that error has no consequence. But there are many other faults that also go undetected, but this time with the worst type of outcome (undetected wrong results) - the application just produces

Some more data on the experiments just discussed, namely on the last type of outcome - detected errors - is presented in Table 2. It is important to know that as soon as one error detection mechanism detects an error, the system is immediately stopped, so that no other detection method has a chance to detect anything. The results in Table 2 show that memory protection is largely the most effective error detection method. The illegal instruction detection is also rather effective, but the other mechanisms have almost no impact. For instance, OS level assertions are of little help, at least in the Parix Operating System (version 1.3).

4. Processor level injection - enhanced machine. 4.1. Enhancements description. Having seen that the number of undetected wrong results was significant, we decided to investigate how far

4

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

we were able to go in enhancing a standard machine in terms of error detection. Several restrictions had to be taken into account: the hardware could not be changed, only small changes to the Operating System were possible, and the performance overhead for the applications should not exceed 10 to 15%. Those were the requirements for something sellable as a commercial product in the present market, as established in the FTMPS ESPRIT project. We developed two kinds of error detection methods: invisible and visible to the programmer. The invisible ones were: enhanced memory protection, several types of new exceptions handlers (for errors detected by the processor), assigned signature control-flow checking, and Assertion

since it essentially inserted some ECIs after each unconditional jump. We did not dedicate many resources to this technique since other experiments [17] had shown that it is not a very powerful method. Still, since it was a software only technique, we used it. Assigned Signature Monitoring (ASM) is a software based control flow checking technique. The program code is divided in several blocks, and to each block is assigned a block ID (signature) that does not depend on the block instructions. That ID is stored in a register at the start of the block, and its value is verified at the exit of the block. A mismatch can only occur in case of an erroneous change of flow. The major drawback of this method lies in the performance overhead due to the execution of the

Description

Execution frequency

Performance overhead

Coverage (% of detected)

BOLTZMAN

Checks the integrity of Boltzman values (spin change probabilities for a known temperature).

Each simulation iteration.

0.1%

0.52%

GRID DATA

Checks the integrity of spin values hold in the sub-grid; must be either 1 or -1.

Each simulation iteration.

8.3%

1.26%

RAND

Checks if the generated random numbers are in the proper interval.

Every spin calculation.

14.4%

1.43%

Table 3. Ising assertions. error capturing instructions. Since memory protection and exception processing were already included in the standard machine, but in an incomplete way, we changed the Operating System to use the capabilities of the built-in memory management unit and processor exception handling as thoroughly as possible. We managed to do that for the exception handling, but were not able to fully use the possibilities offered by the processor, due to limitations imposed by the Operating System design.

Assertion Id. ELECTRON ENERGY OUTPUT FILE SIZE

extra instructions related to storing and testing the signature. The type of ASM that we used [7] has a nice feature: it enables us to choose the size of the blocks to use, and thus vary the overhead (and the coverage). For the experiments reported here we used blocks of 16 machine instructions in average, which leads to an overhead around 3% in memory and 6% in performance. The methods visible to the programmer consisted of assertions and algorithm based fault-tolerance (ABFT). In the Matmult program the ABFT algorithm

Description Checks the decreasing behaviour of the electron energy during the absorber phase of the simulation. Verifies if the size of the output file produced matches the expected (correct) size.

Execution frequency In each simulation step. Just once.

Table 4. Montecarlo assertions. Error capturing instructions (ECI) [19] are trap instructions inserted throughout the memory in such a way that they are not normally executed. If, because of an erroneous change of control flow, one of them happens to be executed, a trap handling routine is activated and the error is caught. Our implementation was rather limited,

described in [16] was used. In the Ising test application there was no place for ABFT, since no arithmetic relationship could be established between input and output data. Here, the future spin of each lattice site depends, other than the neighbour states, on a random value. Thus, spin

5

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

Program Ising Montecarlo Matmult

before 217 1806 192

Size (Kbytes) after 244 1820 237

cost (%) 12,4% 0,8% 23,4%

Time (sec) after 19 68 12

before 13 63 11

cost (%) 46,2% 7,9% 9,1%

Table 5. Overhead of the whole set of enhancements. transitions may occur which do not depend on the initial state. Through the identification of critical data, i.e., data on which the main control-flow decisions depend, some “invariants” were identified. The assertions implemented in this way are described in Table 3. The coverage figures in Table 3 relate to the experiments reported in Table 6. Only two assertions were included in the Montecarlo program (Table 4). The complexity of the FORTRAN source code associated to the highly specialised nature of the algorithm (electron beam dose calculation) would make any further efforts a very time-consuming task. The performance degradation because of these two assertions is just 0.7%. The main assertion (electron energy) is ISING detection methods memory protection signature monitoring illegal instruction program assertions O.S. assertions Error Capturing Instructions other exceptions Total detected errors

Ising outcomes correct result undetected wrong result crash detected Injected faults

Before % 60.5% 14.6%

# Faults 1053 81

% 60.3% 4.7%

17 489 2031

0.9% 24.0% 100%

12 599 1745

0.7% 34.3% 100%

Table 6. Ising before and after the enhanced error detection. We will present the results separately for each program, comparing the “before enhancements” with the

Before No. Faults

405 — 59 — 12 — 13 489

After

# Faults 1229 296

After

Coverage

No. Faults

19.9% — 2.9% — 0.6% — 0.6% 24.0%

295 177 64 56 5 2 0 599

Coverage

Avg. latency

Std. deviation

16.9% 10.1% 3.7% 3.2% 0.3% 0.1% 0% 34.3%

14697 nsec 2146 nsec 1724 nsec 15959 nsec — 1360 nsec — 9611 nsec

69434 nsec 296 nsec 1459 nsec 62148 nsec — 39 nsec — 52828 nsec

Table 7. Breakdown of detected errors according to the detection method for Ising. “after enhancements”.We start with Ising (see Table 6 and Table 7). From Table 6 we can see there was some improvement, since there was a visible decrease in the percentage of undetected wrong results, but around 5% wrong results is still far too high. We can confidently say that the error detection methods used are insufficient. From Table 7 we can see which error detection methods were most effective. Signature monitoring, illegal instructions, and program assertions did show some effectiveness, while the other enhancements were rather ineffective. The average error detection latency is low. This is very important, as having low latencies we can assume that the detected faults have been detected in time to avoid significant error propagation (by stopping the system), which means that these faults could be tolerated by using relatively simple recovery mechanisms. Detailed information on the latency of behaviour based

based on the behaviour of the electron energy during the simulation. Since the primary electron loses energy until it gets absorbed, energy(step) > energy(step+1). Before concluding this section, it is interesting to see what the cost of all the visible and invisible error detection methods was (Table 5). As we can see, the performance overhead is lower than 10% for the Montecarlo and Matmult programs and considerably higher (46,2%) for the Ising. This high overhead is mainly due to the application assertions, but we decided to maintain all the assertions in order to evaluate them all.

4.2. Experiment results. What we want to measure with these experiments is essentially the increase in error detection capability (and the consequent decrease in the production of undetected wrong results) that can be achieved with error detection techniques added to an already existing machine.

6

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

error detection mechanisms similar to some of those evaluated in Table 7 can be found in [17]. The latency observed for the program assertions is also low, which is not usual for this kind of error detection technique, as program assertions are basically checks performed on the produced data/results at the end of the program. The explanation lies in the fact that the application assertions in the Ising program are executed each iteration and the iteration rate is fairly high.

those errors that affect only operands or results of data calculations, as opposed to address calculations). Since we know that many undetected wrong results are pure data errors, signature monitoring did not help much. Assertions could help, as they are geared to data errors, but since the assertions we added to the Montecarlo code were so weak, few data errors were caught. Finally, let us look at the third program: Matmult (matrix multiplication). This is a special program, since the enhanced version uses ABFT - there is a dramatic difference between the two versions of the Matmult program, due to ABFT (see Tables 10 and 11)

We now look at the second program, Montecarlo (Table 8 and Table 9) This program has a quite different behaviour from Ising. The undetected wrong results remained at essentially the same level, the crashes even had a slight increase. The additionally detected errors came essentially from the group that did not interfere with the program’s outcome. Montecarlo outcomes correct result undetected wrong result

Before # Faults 1015 230

% 66.1% 15.0%

Montecarlo detection methods memory protection signature monitoring illegal instruction program assertions O.S. assertions ECI (error capturing inst.) other exceptions Total detected errors

Matmult outcomes correct result undetected wrong result crash detected Injected faults

After # Faults 669 163

% 58.9% 14.4%

208 — 57 — 8 — 0 273

After

% 69.2% 9.8%

# Faults 977 0

% 62.1% 0%

61 371 2058

3.0% 18.0% 100%

15 581 1573

1.0% 36.9% 100%

Table 10. Matmult before and after the enhanced error detection. The number of undetected wrong results went down to

Before # Faults

Before # Faults 1425 201

After

Coverage

# Faults

Coverage

131 88 40 3 6 0 0 268

11.5% 7.8% 3.5% 0.3% 0.5% 0% 0% 23.6%

13.6% — 3.7% — 0.5% — 0% 17.8%

Av. latency

9559 nsec 2085 nsec 1485 nsec 2331733 µsec — — — 26704950 nsec

Std. deviation

42365 nsec 447 nsec 319 nsec 3297499 µsec — — — 431332964 nsec

Table 9. Breakdown of detected errors according to the detection method for Montecarlo. crash detected Injected faults

17 273 1535

1.1% 17.8% 100%

35 268 1135

3.1% 23.6% 100%

zero. In fact, we know that it is not precisely zero - since the ABFT algorithm does not encompass the whole program execution (there is the data input phase in the beginning and the data output phase at the end), errors occurring outside the zone protected by ABFT can lead to undetected wrong results. We have observed that in other experiments [20], although it did not happen in any of the faults injected in this particular experiment. This result is very relevant and shows that the inclusion of ABFT is probably the only way of achieving very high error coverage in a standard machine at a relatively low cost (less than 10% of performance

Table 8. Montecarlo before and after the enhanced error detection. For this program, the enhancements were particularly ineffective. This can be explained (Table 9) by noticing that the only new error detection method that showed some effectiveness was the control-flow monitoring technique, and this technique does not detect almost any pure data error [21][17] (By pure data errors we mean

7

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

overhead in the present case). Unfortunately, ABFT can only be used for some algorithms. From the above results some conclusions can be drawn, within the limits of the system and techniques tested: - Some improvements in the error detection capabilities of existing parallel machines can be introduced by using error detection techniques based on software. However, the need of avoiding significant performance costs impose strong limits to the coverage that can be achieved; - The error detection coverage and the probability of corrupting the application results without error detection are very dependent on the application; - Significant enhancements on standard machines in catching undetected wrong results were only attained by using ABFT; - To obtain high coverage at low cost the error detection methods have to be designed in, and not added in the end as an after thought. Matmult Detection methods memory protection signature monitoring illegal instruction ABFT O.S. assertions Error Capturing Instruct. other exceptions Total detected errors

Before # Faults Coverage 275 13.3% 67 3.3% 27 1.3% 2 0.1% 371 18.0%

5. Communication level injection. The communication subsystem is a potential source of faults in parallel computers. This is especially true for massively parallel computers where the nodes communicate by messages through very high speed links. The communication media are intrinsically prone to electrical interference. This interference can cause errors in transmitted data, propagate into processing nodes and finally cause computational errors or node failures. In fact, those errors can be as harmful as errors in the processor circuitry. Congestion can also lead to lost messages. The use of end-to-end CRCs on the data movements between processors seems to be a logical way of handling faults in the communication subsystem. In fact, recent parallel machines such as the IBM SP2 and the Cray T3D use CRCs implemented at the hardware level of the communication subsystem which allow the

#. Faults 241 49 49 233 8 1 0 581

Coverage 15.3% 3.1% 3.1% 14.8% 0.5% 0.1% 0% 36.9%

After Avg. latency 5064 nsec 2027 nsec 1456 nsec 4299434 µsec — 1284 nsec — 1752750 µsec

Std. deviation 38430 nsec 676 nsec 131 nsec 4122251 µsec — 0 nsec — 3371153 µsec

Table 11. Breakdown of detected errors according to the detection method for Matmult. implementation of reliable communication primitives without reducing the communication speed. Other The inclusion of simple error detection techniques in parallel systems such the ones using the HELIOS the nodes from the very beginning is really required to operating system achieve reliable communication by achieve high coverage at low cost. A clear example of using a complex (and slow) message protocol. this is derived signature monitoring, which can achieve The Parsytec PowerPC running the Parix 1.3 operating very high coverage [17] but cannot be used in most of the system used in present experiments has no error detection existing processors because a small circuit (the signature in the communication subsystem. Thus we decided to monitor) has to be placed inside the processor chip. study the improvement brought by the use of an end-toThe fact is that many existing parallel machines are end CRC of all the information moved from the sender to based on processors originally designed for low-cost and receive process. In practice, in addition to the PARIX commodity products, like personal computers and lowcommunication primitives, SendLink(..) and RecvLink(..), end workstations, in which cost is the prime design we have created two new primitives: SendLinkCRC(..) criterion. Thus, error detection and data integrity features and RecvLinkCRC(..). are not readily achieved in these processors by posterior These primitives have the same parameters as the improvements. However, very recent processors such as original ones. SendLinkCRC(..) simply calculates a 16 bit THOR [24] and ERC32 [8] designed for space flight CRC for the given message and sends it along with the applications do include a comprehensive set of built-in CRC using the normal SendLink(..). On the other hand, error detection mechanism, including derived signature monitoring. Maybe this example can be generalised in the RecvLinkCRC(..) receives the message using the original future to common commercial processors. RecvLink(..), calculates its CRC, compares it with the received CRC and terminates the application if they are

8

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only.

not equal. No provision has been made for retransmission of messages in case of CRC error, as the aim of our work was only to assess the effectiveness as an error detection method of the inclusion of CRCs in messages. The performance overhead caused by these new communication primitives is quite dependent on the application. The size of the messages and the ratio communication/computation are the most relevant factors with this respect. Considering several programs that have been compiled with these new communication System outcomes correct results undetected wrong result crash detected total no. injected faults

Ising # Faults 817 108 34 634 1593

undetected wrong results. A significant percentage of the faults have been detected (from 27.2% to 39.8%). However, many of these detections have nothing to do with the communication subsystem. A clear example of this is when a fault injected in the message payload affects floating point data. In this case the error can be detected in the receiver node whenever the fault caused the violation of the standard floating point format. The system crashes are mainly caused by faults affecting the header of messages.

Coverage 51.3% 6.8% 2.1% 39.8% 100%

Montecarlo # Faults Coverage 92 20.5% 22 4.9% 213 47.4% 122 27.2% 449 100%

Matmult # Faults 547 164 126 450 1287

Coverage 42.5% 12.7% 9.8% 35.0% 100%

Table 12. Communication faults injected in a Parsytec PowerXplorer (system with no changes). primitives, including the three programs used in this experiments, the performance overhead observed was from 2.5% to 73.5%. This strong variation make it clear that the CRCs should be supported by the lower levels of the communication subsystem. It should be noted that communication faults affecting the payload have a quite predictable effect on the CRC calculation, so that the coverage of SendLinkCRC(..) and RecvLinkCRC(..) could be easily calculated by analytical means for this class of faults. However, the same is not true for faults affecting the headers, as they can induce a rather diverse wrong behaviour in the system. Furthermore, the effect of communication faults on the parallel application is also very difficult to estimate. Thus we decided to use again a fault injection approach to evaluate the impact of faults in the communication subsystem [4]. The injection of faults at the communication subsystem has been performed by using the CSFI (Communication Software Fault Injector) tool [2]. The injected faults consisted of burst errors from 1 to 4 bits, and affected randomly any byte transmitted in the affected link, which means that both the payload and the headers of the packets were corrupted. Table 12 shows the results obtained with the original machine, i.e., without using the new communication primitives with CRCs. The results are very dependent on the target program (as usually). However, it is clear that communication faults are as dangerous as processor faults. From 4.9% to 13% of the faults caused the application to produce

Faults injected in the same programs recompiled with the new primitives SendLinkCRC(..) and RecvLinkCRC(..) have shown that the percentage of undetected wrong results was reduced to zero. This was expectable, as the communication faults that cause undetected wrong results are the ones that affect the payload without disturbing the execution of the parallel application (no messages lost, no unaligned data, etc.), thus the check of the CRC can be performed by RecvLinkCRC(..) and the error is detected with a probability of 1-2-16. It should be noted that the effectiveness of end-to-end CRCs for payload faults was expected. However, the fault injection results show that communication faults affecting the system in such a way that the RecvLinkCRC(..) cannot be executed (this happens for many faults affecting the header), do not cause wrong application results (typically cause system crashes). This means that communication faults that were not detected by the endto-end CRCs have been detected by other means (like a watchdog timer) for all the injected faults.

6. Conclusions. This paper presents an experimental evaluation of the impact of faults in an off-the-shelf parallel system running actual parallel programs. Faults have been injected by software in both the processor nodes and the communication subsystem. Results obtained showed quite clearly that a significant number of faults caused the programs to produce undetected wrong results. That is,

9

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only. [4] J. Carreira, D. Costa, H. Madeira, J. Gabriel Silva “Understanding Communication Faults in Parallel Computers”, Proc. of the First Conference on Fault Tolerant Systems, ITT Madras, India, December, 20-22, 1995. [5] R. Chillarege and N. Bowen, “Understanding Large Systems Failures - A Fault Injection Experiment”, Proc. 19th Int. Symp. Fault-Tolerant Computing, Chicago, June, 1989, pp. 356-363. [6] K. Echtle, M. Leu. “The EFA Fault Injector for FaultTolerant Distributed System Testing”, in Workshop on Fault-Tolerant Parallel and Dist. Systems, pp 28-35, 1992. [7] Pedro Furtado "Control Flow Checking Error detection methods in a highly integrated microprocessor" (text in Portuguese), MSc Thesis in Computer Engineering 1996, University of Coimbra. [8] Jiri Gaisler, “Concurrent Error-detection and modular faulttolerance in 32-bit processing core for embedded space flight applications”, Proc. of 24th Fault-Tolerant Computing Symposium, FTCS-24, 1994 p 128-130. [9] S. Han, H.Rosenberg, K.Shin, "DOCTOR: an Integrated Software Fault Injection Environment", Technical ReportUniversity of Michigan, 1993. [10] Ravishankar K. Iyer "Experimental Evaluation" Special Issue of the 25th Int. Symp. on Fault-Tolerant Computing (FTCS-25), Pasadena, California, June 27-30, 1995, pp 115132, IEEE Computer Society Press, ISBN 0-8186-7146-7. [11] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties”, FTCS-22, Digest of papers, IEEE 1992, pp. 336-344. [12] Wei-lun Kao, R. K. Iyer, “DEFINE: A Distributed Fault Injection and Monitoring Environment”, Workshop on FaultTolerant Parallel and Distributed Systems, June, 1994. [13] J. Karlsson, P. Liden, P. Dahlgren, R. Johansson, U. Gunneflo, “Using heavy-ion Radiation to Validate FaultHandling Mechanisms”, in IEEE Micro, Vol. 14, No. 1, pp.8-32, 1994. [14] J. Karlsson, P. Folkesson, J. Arlat, Y. Cruzet, G. Leber, and J. Reisinger, “Application of Three Physical Fault Injection Techniques to the Experimental Assessment of the Mars Architecture”, Pre-prints of Fifth International Working Conference on Dependable Computing for Critical Applications, DCCA-5, Urbana-Champaign, Ill, USA, Sept. 27-29, 1995, pp. 150-161 [15] P. K. Lala, “Fault Tolerant and Fault Testable Hardware Design”, Prentice Hall International, NY, 1985. [16] F. T. Luk, “Algorithm-based fault tolerance for parallel matrix solver”, Proc. SPIE Real-Time Signal Processing VIII, vol. 564, 1985, pp. 49-53. [17] H. Madeira e J. G. Silva, "Experimental evaluation of the fail-silent behaviour in computers without error masking", 24th Fault Tolerant Computing Symposium, FTCS 24, Austin, USA, June 1994, pp. 350-359. [18] H. Madeira, M.Rela, F.Moreira, J. Gabriel Silva. “RIFLE: A General Purpose Pin-Level Fault Injector”, Proc. First European Dependable Computing Conference, pp 199-216, Berlin, Germany, October 1994. [19] G. Miremadi, J. Karlsson, U Gunneflo, and J. Torin, “Two Software Techniques for On-line Error Detection”, Proc. of 22nd Fault-Tolerant Computing Symposium, FTCS-22, 1992, p. 328-335.

the program terminates and everything seems to be normal (no errors are reported), but the results delivered to the user are wrong. Although these results cannot be directly generalised to other machines, they strongly support the position that all current massively parallel machines exhibit the same type of behaviour. This means that fault tolerance is important in massively parallel machines, not only to guarantee continuity of service and proper completion of long-running applications, but also for us to be able to depend on program results. Several simple error detection methods were used to enhance the fault-tolerance of the tested machine. They were chosen because they did not require changes to the hardware, and didn't slow down the system in a significant way. Fault injection results showed that the percentage of undetected wrong results can be reduced for some applications. However, only ABFT (and end-to-end CRCs for communication faults) showed the potential to eliminate undetected wrong results, but ABFT is limited to a small number of algorithms. For a more general approach than ABFT, our results show that, as far as the error detection methods used are representative, to obtain high coverage at the low cost required by massively parallel machines the error detection methods have to be designed in, and not added at the end as an after thought. Error detection is also the first step to implement the fault-tolerance techniques required to escalate the number of nodes without having the system reliability ruined by the “statistical faults” resulting from the very high number of system components. Our final recommendation is that research is badly needed to determine how often do undetected wrong results really happen in production systems.

References [1] J.Arlat, M. Aguerra, L. Amat, Y. Crouzet, J-C. Fabre, J-C. Laprie, E. Martins, D. Powell, “Fault injection for dependability validation: a methodology and some applications”, IEEE Trans. on Software Eng., Vol. 16, No 2, Feb. 1990, pp. 166-182. [2] J. Carreira, Henrique Madeira, João Gabriel Silva. “Assessing the Effects of Communication Faults on Parallel Applications” Proceedings of the IPDS’95, International Computer and Dependability Symposium, Erlangen, Germany, April 1995, pp 214-223. [3] J. Carreira, H. Madeira, and João Gabriel Silva “Xception: Software Fault Injection and Monitoring in Processor Functional Units”, Pre-prints of Fifth International Working Conference on Dependable Computing for Critical Applications, DCCA-5, Urbana-Champaign, Ill, USA, Sept. 27-29, 1995, pp. 135-149.

10

This paper was accepted for presentation at the FTCS-26 and will be Copyrighted by The Institute of Electrical and Electronics Engineers, Inc. It is being distributed for early peer dissemination purposes only. [20] M. Z. Rela, H. Madeira, J. Gabriel Silva, “Experimental Evaluation of the Fail-Silent Behavior in Programs with Consistency Checks” Accepted for 26th Fault-Tolerant Computing Symposium, FTCS-26, 1996. [21] M. Schuette, J. P. Shen, Processor Control Flow Monitoring Using Signatured Instruction Streams, IEEE Transactions on Computers, vol. 36, No. 3, March 1987. [22] Z.Segall, T.Lin, “FIAT: Fault Injection Based Automated Testing Environment”. In Proc.. 18th Int. Symp. Fault Tolerant Computing., June 1988, pp 102-107. [23] D. P. Siewiorek and Robert S. Swarz, The Theory and Practice of Reliable Design, Digital Press, Educational Services, Digital Equipment Corporation, 1984, Bedford, Massachusetts. [24] “THOR Datasheet”, Saab Ericsson Space, 1994. [25] L. T. Young, R. K. Iyer, K. K. Goswami, and C. Alonso, “A Hybrid Monitor Assisted Fault Injection Environment” 3rd IFIP Working Conference on Dependable Computing for Critical Applications, Sicily, Italy, September 1992.

11

Lihat lebih banyak...

Experimental Assessment of Parallel Systems

Descrição do Produto

Comentários