A scalability test for parallel code

Share Embed


Descrição do Produto

SOFTWARGPRACTICE AND EXPERIENCE, VOL. 25(12), 1299-1 314 (DECEMBER 1995)

A Scalability Test for Parallel Code GORDON LYON*, RAGHU KACKER~AND ARNAUD LINZ*§ National Institute of Standards and Technology, Gaithersburg,MD 20899 U.S.A. (emails: lyon@ nist.gov kacker @ cam.nist.gov)

SUMMARY Code scalability, crucial on any parallel system, determines how well parallel code avoids becoming a bottleneck as its host computer is made larger. The scalability of computer code can be estimated by statistically designed experiments that empirically approximate a multivariate Taylor expansion of the code’s execution response function. Each suspected code bottleneck corresponds to a first-order term in the expansion, the coefficient for that term indicating how sensitive execution is to changes in the suspect location. However, it is the expansion coefficients for second-order interactions between code segments and the number of processors that are fundamental to discovering which program elements impede parallel speedup. A new, unified view of these second-order coefficients yields an informal relative scalability test of high utility in code development. Discussion proceeds through actual examples, including a straightforward illustration of the test applied to SLALOM, a complex, multiphase benchmark. A quick graphical shortcut makes the scalability test readily accessible. KEY WORDS: designed experiments; MIMD scalability; parallel processing; performance evaluation; sensitivity analysis

1. BACKGROUND From its earliest days computing has employed performance statistics to guide the design, writing and tuning of code. The most popular statistics by far are counts and timings that profile the computing demands of pieces of code. Gathered profiles on a classical serial architecture machine reveal clearly where a program consumes available computing cycles.’ Profiles become more confused on multiple-instruction, multiple-data (MIMD) stream parallel systems, since the straightforward linear execution of serial code is replaced by many concurrent threads. While a single executing thread defines a clear-cut state (such as executing some routine), with concurrency and multiple threads this ‘state’ becomes but one component of a product state defined over all processors. Unfortunately, such a state space grows combinatorially with the size of the host: Finding a suitable view of the space is a major challenge.* Another profiling aggravation arises from interactions of parallel code. Since parallel processes within a job are co-operating together, code interaction is deliberate. Synchronizations are examples. However, profile information on interactions is not easily captured via conventional event instrumentation. Because interactions can grow exponentially with the number of co-operating processes, there can be a tremendous number * Advanced Systems Division, B364/223, (301)975-5679, FAX (301)926-9675. t Statistical Engingeering Division, A337/101.

Advanced Systems Division, B364/223. 8 Visited NIST in 1994 from Institut National des T&?communications, Evry, France

CCC 0038--0644/95/121299-16 0 1 9 9 5 by John Wiley & Sons, Ltd.

Received 1 August 1994 Revised 6 April 1995

1300

G. LYON, R. KACKER AND A. LINZ

of recorded events. For example, n processes have 2n - n - 1 potential process interactions, ranging from an unlikely nth-order interaction down to the quite common two-way. Thus, applied to scalable parallel programs, conventional execution statistics and their interpretation metrics are often stressed badly. As parallel systems scale to hundreds of nodes and beyond, execution profile analysis becomes problematic using conventional means.

1.1. An alternative A frequently overlooked alternative to code tuning is the statistically designed experiment (DEX), which handles program and system together as an amorphous entity. This approach does not claim that the entity has no inherent structure but rather, that much detail remains hidden from the investigator. For example, a complete analytic model of a MIMD computer system and program will probably be too expensive or time-consuming to justify. DEX’s empirical entity has controllable input settings to factors, which are structural components chosen for i n t e r e ~ t . ~Factors -~ can be code segments (basic blocks) suspected of being bottlenecks; these factors are set by making benign changes to their code. Parallel system size is another factor; it is determined by the number of processors, s, assigned to a program. The complete setup of factors for a run constitutes a treatment. A statistically designed experiment correlates runs of its treatments via a response measured for each. Overall program run time is a commonly chosen response. MIMD responses have uncertainty, noise. Causes of this variation are many: Caches reach slightly different states; clocks give limited resolution; interconnection traffic flows in various orders; minor system services cause processor interruptions; and, algorithms become non-deterministic. The powerful DEX macromodeling or curve Jitting perspective estimates the degree to which factors and interactions among factors influence p e r f o r m a n ~ e .As ~ ~ will ~ . ~ be seen in the scalability test, characterizing factor interactions is important. State in the DEX approach shifts from the executing program, where the state space is huge and time is an element, to a collection of treatments (factor settings) for the test program. Considerably smaller and static, this new space defined upon input draws upon a wealth of DEX methods to simplify its handling and interpretati~n.~.~

1.2. Scalability Code scalability, always crucial on any parallel system, determines how well a section of code avoids becoming a bottleneck as its host computer is made larger. A recent note by Snelick et al., presents an interesting method of determining the scalability prospects of parallel code.5 Their approach employs DEX and synthetic perturbation (SP),4the latter arising because applying DEX to test software performance requires economical ways of modifying code efficiencies to achieve treatment settings. Naturally occurring, adjustable parameters are unlikely in arbitrarily chosen code segments. Segments lacking natural parameters can be recoded, but these recodings must be checked for correctness, an impracticality when the set of segments is large. Synthetic perturbation, SP, is an inexpensive method of code modification that employs artificial means.4 The SP simulates local, easily adjusted efficiency changes, which in turn drive DEX sensitivity analyses. High setup efficiencies are inherent in SP. For our purposes here, each SP will be extra code that causes delay. This can be smoothly inserted or removed without

A SCALABILITY TEST FOR PARALLEL CODE

1301

interfering with actual computations. An inserted delay must be large enough to affectperformance at levels above the background noise of an experiment. At the same time, the delay should not significantly slow or distort execution (cf the discussion in Reference 4, Section 5). Context is important for the delay and the efficiency change it actually induces. Testing a short segment of code typically takes a delay of one to ten dummy floating point multiplications. The entry point of a heavily computational subroutine demands a more substantial delay commensurate with the routine’s prolonged execution. Otherwise no change is apparent. Similarly, a delay in a tight loop clearly affects program run time far more than the same delay in a one-shot initialization. Placement and size of the delay implicitly frame a question about the sensitivity of the specimen program to that specific change in code efficiency. This question should make sense. Otherwise, subsequent analyses are misleading. DEX analysis yields an approximate multivariate Taylor expansion of the joint programsystem execution response. Details of this correspondence are elaborated at appropriate points in the text. Loosely stated, a response is represented as the sum of a constant plus weighted terms. The weight coefficients are effects. Each suspected code bottleneck corresponds to a first-order term in the expansion, the effect for that term indicating how sensitive execution is to changes in the suspect factor. Effects for second-order interactions between code segments and the number of processors express which segment improvements best promote parallel speedup. A new, unified view of second-order terms yields a simple scaling test that can be calculated or graphed. The test has high utility in focusing parallel code improvement efforts. However, our discussion begins with the scalability test already mentioned.

1.3. Comparing two tables Snelick et al. base their scalability test for parallel code upon orderings of DEX effect^.^ In their approach, the number of processors s is handled differently from other factors. There is a separate preliminary stage for each of two distinct settings of s. Each preliminary stage applies DEX to the program to establish effects for selected segments of code, which are the factors. This initially involves running treatment setups of the program and recording response measurements. A relationship between treatment settings and observed data is then modeled by statistical effects. To this end, a conrrast is calculated for each factor i: the mean response for all runs with a factor i set ‘low’ is subtracted from the average for all runs with the factor i set ‘high’. The final result is an effect, pi, illustrated by entries in the third column of Table I. A later section, Interpretations, shows sample calculations for pi. A second stage now compares the two first-stage tables of effects. The following example, taken and enlarged slightly from Reference 5, compares the effects of code segments from a 3500-line image understanding benchmark (ZUB)against the size of the host system. Tables I and I1 show rankings of the ZUB software sensitivity effects for s = 8 and s = 24 processors, respectively. Informally comparing rank-orderings of effects in the two tables yields qualitative estimates about scalabilities of segments of code. As a factor gains in rank, it is increasingly worth inspecting. For instance, factors F2, F1, and F4 migrate to higher rankings in Table 11, drawing attention to their parent routine, Connected Components. This comparison is not as simple as it might be, since it involves orderings from two tables. However, it is not difficult to devise a new test that has only one table.

1302

G. LYON, R. KACKER AND A. LINZ

Table I. IUB effects, eight processors. Factor i F17 F26

1 2 3 4

OF2

oF1 oF4 F25 F20 F29

5 6 7 8

Effect (A)t 6.03 5.46 5.26 3.94 3.84 2.01 1.67 1.36

Routine Grad Magn Med Filt ConnComp Conn Comp Conn Comp Med Filt Match Probe

Const. for while for funct while for funct for

tStandard Error (Uncertainty): f0.04. .Factor of special interest; see text. Source: R. Snelick5.

2. NEWTESTS

Major simplifications can be applied to the above compai,;on test. Discussam focuses first upon host system scale (processor count), s, a factor that can be promoted to regular status with the same standing as any software factor. This change yields one unified view rather than two distinct tables. Consequently, interpretation is more straightforward. Table 11. IUB effects, 24 processors. Factor i 1 2 3 4

O F 2

5

6 7 8

t

oF1 eF4 F17 F26 F20 F29 F25

Effect pi (B) t 5-43 3.95 3.93 2.14 2.03 1.55 0.95 0.75

Routine Conn Comp Conn Comp Conn Comp Grad Magn Med Filt Match

Probe

Med Filt Standard Error (Uncertainty): f 0.12.

Const. for funct while for while funct for for

Factor of special interest; see text. Source: R. Snelick5.

0

2.1. Using only one table Imagine for the previous example (Tables I and 11) a new midpoint, Table 111, for s = (8 + 24)/2 = 16 processors. Linearly interpolate effects from Table I, column A and Table 11, column B effect-by-effect as (A B)/2. To see why this works, consider a factor z appearing in both tables. Each table’s effects are built from its own set of response measurements { T g } or (7-24) of n distinct treatments on 8 or 24 processors, respectively. Now, effect 0, for x depends (within a multiplicative constant) upon a contrast between the average measured response for x set ‘high’ (occurs in n/2 treatments) and the average

+

A SCALABILITY TEST FOR PARALLEL CODE

1303

Table 111. Scalabilities via second-order terms. i

F2 F17 F1 F4 F26 F20 F25 F29

Effect pi (A+B)/2 t 5.35

4.09 3.95 3.89 3.75 1.61 1.37

1.16

Interaction (B-A)/2 t 4.09 - 1.95 +0.01 4-05 -1.72 -0.06 -0.64 -0.21

Prospects for scaling 00

dhal

++ best

poor poor best poor better o mediocre 0 0

++ +

t Srandard Error (Uncertainty): f0.07. Source: Tables I and 11.

response for z set 'low' (occurs in remaining n / 2 treatments). For example, with Table I:

Disjoint response sets ( T 8 ) and ( ~ 2 4 ) can also be treated as 'low' and 'high' data subsets of a larger set (T8u24) = ( 7 - 8 ) U ( ~ 2 4 ) . Thus combined, settings s = 8 and s = 24 define an additional factor, system size, in a larger experiment combining all 2n treatments. A contrast for factor z in these larger circumstances is then

Expanding the above via equalities such as

and rearranging terms yields:

This derived expression, the average of contrasts from each original response set, substantiates the (A B ) / 2 ) interpolation rule for effects in Table III. The second numerical column in Table 111expresses second-order interactions (B- A ) / 2 ; these predict changes in effects vis-84s scaling factors. While effects in Table 111calculated via the interpolation rule express current influences of factor settings, it is the interactions pi,+that predict further scaling success. They do this explicitly and quantitatively. A more negative interaction pi,+in Table I11 implies a more scalable factor. For instance, the scaling interaction P F 2 , + = 0.09 for factor F2 is greater than zero. F2 does not scale, Since its effect increases as processors are added. Such results confirm more precisely the conclusions reached earlier in the two-table comparisons.

+

1304

G . LYON, R. KACKER AND A. LINZ

Table IV. Scaling experiment.

t)

Xcd

X,

Icd,s

Response, R

-1 +1 -1 +l

-1 -1 +1 +I

+1

a

(40)

-1

b

(44)

-1 f l

C

(24)

d

(29)

(actual

tMeawred in seconds.

2.2.

Isolated segments

While a single table is an improvement over employing two tables to test scalability, the refinement, represented in Table 111, must still display numerous code segments together. This requires many runs of the program. The next improvement yields a scaling test suitable for single, isolated segments. Only four runs are needed. To achieve this economy, the test cannot rely upon comparisons with other code factors, although the interpolation mechanism for scale s remains fundamentally as already introduced in the single table improvement. Imagine a scaling assessment for a routine cd() that is a phase of a hypothetical parallel program. By substitution, steps for this example generalize to other specimen segments and programs. The evaluation assesses the degree to which changes in code cd() and scale s affect performance, here measured as overall program run time. Responses other than run time can be substituted. A later example employs the resolution of a solution that is computed in a fixed period. Memory demand is another measurable response4 The test factor c d ( ) is represented in Table IV with a setting Xcd = -1 for its original code and Xd = +1 for another copy c d ( ) modified with added delay. For consistency, one should adhere to this convention. Of course, if a less efficient version of cd() is available, it can be used in lieu of cd’(). X , = -1 denotes a host system configuration of s processors, while X , = + I indicates a larger configuration of s‘ > s processors. Again for consistency of interpretation, this matching of the small host to -1 and the larger host to + I should be maintained. The interaction term XcdXS is +1 when Xcd and X , are set similar and -1 when they are dissimilar. Each row of Table IV records a measured response that arises from the program running with the specified treatment (row setup). Responses are labeled a-d. Observe that response R is overall run time and not time in segment cd(). This is sometimes misunderstood. It is a great advantage that the scaling test need employ only easily-obtained data. The underlying test model views the program as a transfer function with two parameters, viz. R(Xcd,X,). Table IV has the information needed to flesh out an approximate Taylor expansion of function R ( ) .Effects P c d , P,, and Pd,, in this expansion reveal execution sensitivities of the specimen program to changes in code efficiency (Xcd),size of host (X,), or both factors acting in concert (Id,,). Let p be a constant term. The model is thus:

=

PcdXcd

+ P s x s + Pcd,sIcd,s+ p = R

(2)

Equation (i) defines a response surface which is linear in term coefficients (effects). This linearity is adequate for small changes in cd and s or for coarser developmental evaluations. Since code is being changed constantly during improvement, highly accurate predictions actually waste effort. (Curvilinear circumstances are often skirted by suitable transformation. For example, y = uxb becomes Y = A B X via a logarithmic application to both sides3) and R into equation (i) yields four simple Inserting Table IV values for Xed, x , ,

+

A SCALABILITY TEST FOR PARALLEL CODE

1305

equations:

Solving directly:

a+b+c+d 4 -a b- C+ d Pcd = 4 -a - b + c + d Ps = 4 a-b-c+d Pcd,s = 4 P=

+

2.3. With actual values Table IV has some concrete values for responses a - d that help illustrate the expansion of R. Constant ,u = 34.25 is response R averaged over all runs of the experiment. Effect P d is a slope, the rate of change of I? as X d varies, A I ? / A x c d . Imagine the average response when X c d is set +1 minus the average response when X c d is set -1; the net result is then divided by two because the input domain is 1 - (-1), a distance of two. Rewriting with values of R from Table IV shows this: Pcd =

44+29 2

40+24

= 2.25

2

Coefficient Ps is another slope, the overall change in response per additional processor. Filling in concrete values, PS = (-40 - 44 24 29)/4 = -7.75. The valuation of effect @cd,s for the interaction term Id,+is central to the scalability test. Expressing the effect of an effect, Pcd,s is the difference per unit X , of the difference in R per unit X d : d-c b-a A ~ R -- 2 2 hd,s = A x c dA x s 2 Here, Pcd,+ = 0.25. The interpretations reinforce earlier claims that DEX approximates a Taylor expansion of R ( X c d , X , ) . There are two first-order coefficients ( P c d , Ps)and one second-order coefficient ( P c d , s ) . These are also least squares estimates of the effect of code, of processors, and of their joint intera~tion.~ Furthermore, constant p, the average response over all runs, can be expressed more explicitly. A solution of the equations is centered at point ( X d ,X s ) = (0,0 ) , an average treatment setting over all runs; it follows that R(0,O) Y p. Substituting R(0,O) into equation (i) and reordering:

+ +

WLi,XS)= R(0,0) + PCdXcd +

P S X S

+

Pcd,SId,S

(i*>

1306

G . LYON. R. KACKER AND A. LINZ

Recall from elementary calculus that this approximates Maclauren’s series expansion about zero of R().Taylor’s expansion, which admits centerings other than zero, generalizes the Maclauren.

2.4. Noise, error and significance The interpretation of the results depends upon further information on experimental error. It is assumed that there is uncertainty attending any measurement. Standard error, SE, characterizes experiment noise. Noise-generated effects, essentially worthless coefficients in the expansion of R, are samples from a normal distribution centered at zero with standard deviation estimated by SE.’ Any effect can be real or it can be from sampling noise. There is always a risk that chance variations in the measurements do not cancel out, but instead, give a false indication. By setting the noise band as 0 f 2SE, an approximate 95 per cent confidence interval is established, assuming errors are normally distributed. Significant effects lie outside this band. If SE is known from previous experience, it can be used. (The example for cd() assumes this circumstance.) Otherwise, replications of trials should be run. Let the repeated measurements be a’, b’, c‘ and d’. Standard error is expressed as:

SE =

:& 4

a’)2

+ ( b - by + ( c - .‘)* + (d - d’)2

Calculate coefficients as before, only insert response averages, e.g., G = (a single measurements a-d now appear. Repeating the example coefficients: p = 34.25

ps

= -7.75

pcd = 2-25

+ a‘)/2, where

p&,s = 0.25

These coefficients are relative to the factor settings discussed for Table IV. From prior experience, the standard error is known to be SE = 0.10. All three effects (the p terms) are significant at an approximate 95 per cent confidence interval of f 2 S E = f0-20. Average response p = 34-25 establishes a context for evaluating magnitudes of effects pi. The scaling sensitivity of the overall system is expressed by effect ,Bs = -7.75. Had ,Bs 2 -0.20 = -2SE, there would be little point in adding further processors; at best, they would not slow computation. In this example, good scaling yields are still available. Next, consider effect &,d = 2.25, which expresses the sensitivity of the system response to changes in cd(). Since & , is well outside the f 2 S E noise band, efficiency changes in cd as the result of delays are significant. But the main question is whether /3& grows or diminishes with s, the scaling factor. To answer this, look at the interaction effect &d,S = 0.25 > -0.20 = -2SE, which indicates a failure in scaling for p&. The latter effect grows mildly with increasing scale s, so that R is increasingly sensitive to changes in cd(). Whenever both pS < -2SE and pcd,s < -2SE, the scalability assessment of c d ( ) is not immediately obvious through inspection. A scaling proportionality is one useful criterion that can be applied. For the criterion to make sense, all terms p, &, p&, and Pcd,+ must be significant. The idea is to restrict acceptance of the estimated sensitivity of cd() under scaling to at most its current relative importance ,&d/p. Thus, if through speedup p becomes half of its present value, then the corresponding&, must be at most half of its current value. The scaled value of pcd is predicted via slope Pcd,s, so this slope must descend at least in

1307

A SCALABILITY TEST FOR PARALLEL CODE

proportion to ps, which predicts scaling change in p . Expressing this in terms of the known values: Pcd c d ( ) scales. ,&s 5 --& < 0 P

2.5.

DEX and Taylor's expansion

The latter test of relative scalability is built upon four Taylor expansion terms. A fuller sketch clarifies the relationship between designed experiments, DEX, and this expansion of response R. Box, Hunter and Hunter have remarked that 'the main effects and interactions [in DEX] can be associated with the terms of a TayIor series expansion of a response function. Ignoring, say, three-factor interactions corresponds to ignoring terms of thirdorder in the Taylor expansion' (Reference 7, p. 374). A Taylor series expansion of a function f() is a power series (Reference 10, pp. 415416). If is the ith derivative of f ( ) s fo(), then f(z) can be expressed in terms of f and its derivatives evaluated at some other point, say a. Given (z - u ) as the displacement from point a,

fi()

fi(4

f ( x ) = -00y ( x - ay-. i=O

i!

A more complex but essentially identical expansion holds for cases when f has multiple arguments. Let

f(>

R(xl,x2,X3,...,xk)

be a multivariate response function with k input variables corresponding to k factors of interest. Let the Taylor expansion of R be centered at a = 0, so that ( x j - a ) = x j . As mentioned earlier, this special case is a Maclauren series: It also centers the expansion amid settings of xj = f l , which, looking ahead, corresponds to DEX convention. R is thus expressed as: k

R ( x ~ , I c,..., ~ , x k~) =

I"[

R(6)+Cx. j=1

axj

1 k +,Ex2 J=1

[el axj2 0

For many experiments including the scaling test, a two-level setting of factors is usually adequate; this implies that any second- or higher-order derivatives in the same variable are insignificant relative to the first-order derivatives. Here, they are assumed to be identically zero, e.g., d 2 R / a x j 2 = 0 in equation (22). The average DEX setting of the inputs X i is zero; this stems from a deliberate design in DEX treatment patterns. The change of variable Xi t xi is consequently straightforward. Then as observed earlier, term R(6) in (ii) corresponds in the DEX formulation to p, the average response over all settings. Derivatives that are not identically zero correspond to DEX /3 coefficients. Thus, aR/&j = pj and similarly d2R/dxiazj = /3i,j. The product x i x j in equation (ii) should look familiar, for it is the second-order interaction I i ,=~ X i X j as defined in DEX usage. Applying all these substitutions in equation (ii) produces a result that is a standard DEX formulation, the first

1308

G . LYON, R. KACKER AND A. LINZ

Table V. SLALOM code on fixed-size computation order, 0 n n2 n2 n3 n

Phase 2 region setup1 setup2 solver storer

Scores 1.21 1.43 1.30

1.15 0.37

Evaluation$ pass pass pass pass fail

s: 4 and 32 processors. $Score less than unity indicates poor scaling. SDifering SP delays render scores incomparable.

example of which was equation

(2).

k

k

k-1

Equation (iii) demonstrates that DEX analysis does indeed capture expansion terms of a system’s response function, as asserted by Box, Hunter and Hunter. The result is a ‘one size fits all’ model that can support surprisingly insightful investigations without demanding the measurement of atomistic details.

3. IN PRACTICE An actual benchmark best illustrates the test and its necessary variants. A version of SLALOM, the Scalable Language-independent Ames Laboratory One-minute Measurement, has been converted to a 128 node iPSC/860 hypercube parallel system. SLALOM solves a complete, real problem in optical radiosity within a box.8 The benchmark is designed for a fixed run time; the measured response is the resolution achieved in the solution within the allotted time. This scalability makes SLALOM suitable for assessing a broad range of computers, from small to massive. However, SLALOM can also be run with the workload fixed and time as the response. The example begins in this latter mode, but both are demonstrated. The first tests execute SLALOM as a conventional fixed-size benchmark. The measured response is run time. While this is certainly not the designers’ intent for SLALOM, it exhibits the scaling test as just described. The results are reasonable. Numerous phases, denoted by IC, of the first, fixed-resolution computations are shown in Table V along with their algorithmic orders and scaling results. The scaling test inequality has been slightly recast: since is negative, P P T , S s phase 1 5 --

Pz

IC

scales

Ps

Most of SLALOM’S phases are satisfactory relative to overall scalability of the program (more on this later). Order relates amount of computation to problem size. Notice in Table V that the algorithmic order of a phase does not accurately predict its scalability. Storer is only O ( n ) but it nevertheless scales poorly. This code writes results; inspection shows that it is serial code, an unvarnished parallel bottleneck. A second set of scaling experiments is performed on an improved version of the iPSC/860

1309

A SCALABILITY TEST FOR PARALLEL CODE Table VI. SLALOM code with fixed-time computation

-Phase x

order, 0

setupl solver storer

n2 n3

Score 2.53 1.01

Evaluation pass pass fail

n t s : 4 and 32 processors. $Pstorer,s falls into the noise band uf f 2 S E . §Differing SP delays render scores incomparable.

SLALOM test code. Run time is fixed. The measured response is the resolution achieved in the solution, the number of ‘patches’. While the fundamental approach remains unchanged, the test must be recast to account for a response that now grows larger with scaling. As always, signs of effects must be handled carefully in building a test inequality. For example, with solver, p > 0, Psolver < 0, PS > 0 and &olver,s < 0. Overall performance is a positive quantity and delays in solver diminish the ‘patch’ count. Scaling in s boosts the count. The effect of delay in solver worsens as the system scale increases. It is a question of whether this deterioration outpaces the general program improvement. Following earlier arguments, the test restricts the effect of solver under scaling in proportion to its current relative importance IPsoluer/pI.If speedup makes p twice its present value, then the corresponding effect ,Llsoluer must be equal to or larger than twice its current negative value. The scaled value of Psoluer is predicted via slope Psoluer,S,so this slope must descend no steeper than a proportion I,Bsoluer/p[of -Ps. The scaling criterion for this case then becomes: Psoluer

1 2 - p

Ps

=+

solver scales

P s o lu e r ,s

It would be better if the proportionality were simply one expression, but this is not possible. Varying signs on coefficients mean that in each circumstance the rule must be checked. Not doing so can yield reversed indications of scalability. Table VII gives a summary of conditions. Phases setupl, solver and storer are shown in Table VI. Storer, its effect falling into the noise band, fails once again. Occasionally an effect lies in noise because the chosen perturbation is too small. A larger delay then corrects the deficiency. Here, an increased delay yields no such salvation; storer’s failure is probably real. Phase solver is the truly dominant phase of SLALOM. Its test indication of 1-01 (pass) thus makes sense, since the test compares a code segment against the average scaling performance of the entire Table VII. Conditions for scalability test on z Expected R diminishes increases

0.3 < -2SE‘ > 2SE’

P X

> 2SE’

< -2SE’

px,s

< -2SE3 < -2SE4

~roportionality’ 15 J Pz P

L

~

15 .!?EL P Pl,d

1 Otherwise, condition conflicts with expected response R. 2 On the opposite side of the noise band, a delay improves pegormance. 3 Whenever 2 -2SE, x does not scale. 4 Whenever 2 -2SE, x scales. 5 As always, p > 2SE.

1310

G. LYON, R. KACKER AND A. LINZ

program, P s .

3.1. Results in context The acceptability of Ps is an issue which is often crucial to the overall application of test results. Up to this point, it has been deemed satisfactory that /3, not be in the wrong direction or lie in the noise band. But many test users will need a more precise evaluation of ps. Fortunately, there is sufficient information in the collected data and calculated coefficients to do this. For example, the actual response expansion model - see equation (i*) - for the most recent test of solver is

R

= 699.8 - 218.3

Xsolver+ 204.8 Xs - 63.2 Isolver,s

This model is centered at s = 18 processors, halfway between s = 4 and s = 32. Each f l change in X, represents 14 processors (e.g., 18 14 = 32). The coefficient P, = 204.8 can be interpreted as 204.8/14 = 14.6 ‘patches’ per additional processor. Now, measurement from the experiment for the delay-free case at s = 4 (treatment [-1, -11, response a) indicates 650 ‘patches’ per run. This is 650/4 = 162.5 ‘patches’ per processor per run. So at s = 18, the estimated gain per processor is nine per cent, 14.6A62.5 = 0.09, of individual processor performance at s = 4. While diminished returns are not unexpected, broader circumstances will dictate whether a low return is worthwhile. Certainly, if marginal gain becomes lost in the noise band, host scaling can stop.

+

3.2. Conditions of applicability The scalability test and its calculations apply in a general sense to any segment of parallel code running on any system. Having said this, limitations do apply on use. Test results must corroborate, at least not contradict, prior experience. Parallel code improvement can be subtle - it cannot proceed as a mechanical application of statistical estimation rules. As with any estimation, a scalability result must correspond with what is being observed and what is known a priori. Any extra knowledge an investigator has about a specimen program should be brought to bear. This information can be empirical, e.g., performance expectations from past experience. Alternatively, it can be salient analytical features that are expected to dominate performance. The design objective of the scalability test is to estimate whether a segment is or is not a problem. Such a screening strategy is appropriate for code undergoing impr~vement.~ The formulation rests upon measurement of runs a, b, c and d in noise to establish four estimated effects: p , P c d , Ps and P,-d,s. These effects must be significant in relation to the system noise. A common criterion is a 95 per cent confidence level (f2SE). This level of risk has been chosen for convenience; designed to assist programmers writing scalable software, the test will be administered repeatedly as code is changed. Speed of testing is central. If a scaling result seems odd, it can be repeated. The reliability of the process can be improved via T replications of runs; averaging T runs helps by diminishing S E by fi. However, replication slows testing. Another possibility involves readjusting the noise band. Raising the acceptance level of effects from 2SE to 3SE gives a 99+ per cent confidence in each, thereby improving overall confidence in the set of four. But more effects fall into this broader noise band, since it is 50 per cent wider. This then requires larger perturbations

A SCALABILITY TEST FOR PARALLEL CODE

1311

Table VIII. Summary of test steps Step 1 2 3 4 5

6 7 8

Action Select response of interest Select test segment in program Determine delay Set host sizes Run treatments, measure Plot, or Calculate IF SE known Replicate step 5 Calculate SE and results

Notes for whole program not huge, not tiny close together is best get a, b, c, and d stop. get a’ ,b’ ,c’ and d’ stou.

to overcome noise. Large perturbations are not especially good for the test. The scalability test works best with small perturbations in code efficiency (delays) and scale (extra processors). To see this, recall that test validity hinges upon the fidelity of linear interpolations between performance measurements. For smaller changes in factor settings, curvature in the response is of little consequence. In practice, perturbations cannot be shrunk arbitrarily small, for all effects will disappear into the noise of the experiment. Test results are centered among the co-ordinates of the measurement points. Response R should be ‘well-behaved’ between these treatment settings. Not all specimen programs will qualify. A typical instance of unsuitable R involves code with message congestion, failures and retransmissions on a parallel system’s interconnection n e t ~ o r k ;here, ~ R is unpredictable with even minor factor perturbations. One quick pre-test for unfamiliar code is to vary an important parameter while watching response R.4 Should R behave in some unexpected manner, the parallel program must be approached with caution. Of course, this same warning holds for the qualitative scaling test from Snelick et aL5 In summary, there are several limitations: the response R should be ‘well-behaved’ between treatment setups; estimated terms must be significant; results of scalability testing should not contradict what is known; and, perturbations for setups should be within reasonable limits.

3.3. Graphical shortcuts Scalability is an interaction between a specimen code segment and system scale. Quick graphical techniques can be adapted to evaluate such interactions (see Reference 6, p. 62). The resulting plots rapidly identify good or questionable scalability circumstances. Because they are simplifications, graphical scalability tests differ slightly from their previously mentioned calculated counterparts. The test graphs work directly on measured data for ease of application. This yields slightly different estimations of scalability, apparent in cases that are marginally acceptable. Fortunately, such minor variations matter little. Another consequence of simplification is diminished precision. Every term in a calculated test involves all measurements (a, b, . . .); this promotes the averaging out of noise. The graphs cannot employ this feature and at the same time be easily drawn. The choice was for the latter, ease of application. There are two types of graph. The first handles responses that diminish with added parallel resources. Run time is an example. One expects a parallel code to run faster with more processors. The second graph type involves a response that grows with additional processors; SLALOM’S increased resolution exemplifies this circumstance. Whichever case,

1312

G. LYON, R. KACKER AND A. LINZ

Table IX. Full information for graph test Treatment settings (point label) ( X axis) s1 no delay 81 delayed 82 no delay 32 delaved

Measured response (Y axis) a b C

d

the graphical test is summarized in Table VIII, steps one through six. The information layout, seen in Table IX, has treatments labeled by delay settings and numbers of processors. Response measurements a , b, c and d constitute a third column. Scale s defines the X plot axis and response, the Y plot axis. Each row of data in Table IX defines a labeled point. Points with the same label define a line. Table IX thus provides two labeled lines, ac and bd. The orientation of these two lines with respect to each other determines test outcomes. The first case covers diminishing responses; it is illustrated in Figure 1 . Lines ac and bd both descend as the scale increases. If projections to the right of these two lines intersect on or above the X axis, the test segment scales relative to the rest of the program. The stipulation ensures that b / a 2 d / c , so that the proportional influence of the test segment is not increasing. Intuitively, the intersection of ac and bd shows that the program with delay will eventually perform as well as its undelayed version, all other things remaining equal. However, this estimation holds only in the region between s1 and s2, regardless of where the intersection occurs. Conversely, should bd and a c lie parallel or diverge to the right, the code segment does not scale. Figure 2 shows an increasing response, such as transactions per minute. Line ac climbs as processors are added. Line bd is below ac because perturbation in the test segment cuts output response. The delayed-code program (line bd) realizes b / a of undelayed performance

Response b

S1

s2

Figure 1. Decreasing response

Scale

A SCALABILITY TEST FOR PARALLEL CODE

1313

Response

C

Scale Figure 2. Increasing response

at s1. This fraction should be no smaller at s2. Such a condition can be tested graphically via projections to the lej? of both lines ac and bc. If the projections do not intersect above the X axis, the condition b/a 5 d / c holds and the test specimen scales. Figure 2 as drawn depicts an acceptable outcome. Observe that parallel lines, ac 11 bd, also meet the scalability criterion for this case. While obvious intersections and placements present no problems, graphs cannot be interpreted with much confidence in marginal cases. Also, other dispositions may arise that do not fit either Figures 1 or 2. Since graphing obeys the same conditions of limitation as calculated results, some parallel programs are not going to be suitable. Their graphs, which fail to make sense, signal a need for further information. 4.

CONCLUSIONS

Strong, general mathematical formulations from DEX serve well in the design of a scalability test for parallel code. Inserted synthetic delays further remove any need to modify the original coding; this greatly expedites test setups and avoids recoding errors. Benchmarks SLALOM and ZUB have served as discussion vehicles. The scalability of the larger ZUB is reported in the 1iteratu1-e;~ the approach here compresses those published results into a more accessible, quantitative format. Sun and Rover have studied SLALOM scalability on an nCUBE system. Because SLALOM'S heterogeneous code structure tends to thwart analytic metrics, they have employed performance instrumentation and visualization tookg The scalability test provides a different SLALOM perspective that requires neither special analysis nor data-capture instrumentation. Graphical visualization is easy. Yet there is a great amount of flexibility available to the investigator. For example, the choice of dependent variable, the measured response, allows an easy shift from the usual scaling test for a fixed-size problem to SLALOM'S fixed-time constraint. Other tests or scalability criteria are possible.

1314

G. LYON, R. KACKER AND A. LINZ

The new scalability test is holistic and is dependent upon measuring external responses rather than recording complex details within a program and host system. Analyses correlate gross behavior that has been observed numerous times. As a result, the new test evades the abstraction errors that so often plague more analytic approaches. Of course, there is a price. Several extra runs are necessary to establish effects or to draw graphs. However, these runs are not burdensome considering the estimates they provide. No special measurement features are needed. Independent of system, programming language, and software specimen, the scalability test provides a portable, quick and easy assay that can guide deeper investigations into causes. ACKNOWLEDGMENTS

R. Snelick shared additional data on the ZUB code and T. Wheatley read and criticized an early draft. Our use of an iPSC/860 hypercube system is courtesy of the Computational Science and Engineering Section, The National Institutes of Health, Bethesda, Maryland. Parts of the text derive from a paper by Lyon and Kacker for the 2nd IEEE Symp. on Software Merrics (October 1994, London). NIST Task 40131 and ARPA Task 7066 provide major support for this work. Contributions of NIST are not subject to US.copyright. REFERENCES 1. S . Graham, P. Kessler and M. McKusick, ‘Gprof: A call graph execution profiler’ Proc. ACM SIGPLAN Symp. on Compiler Construction, June, 1982, 120-126.

2. A. L. Couch, ‘Categories and context in scalable execution visualization’, Journal of Parallel and Distributed Computing, 18, 195-204 (1993). 3. R. Jain, The Art of Computer Systems Pe$ormance Analysis, John Wiley & Sons, New York, 1991. 4. G. Lyon, R. Snelick and R. Kacker, ‘Synthetic-perturbation tuning of MIMD programs’, The Journal of Supercomputing, 8, 5-27 (1994). 5. R. Snelick, J. JBJB, R. Kacker and G. Lyon, ‘Synthetic perturbation techniques for screening shared-memory programs’, Sofiare-Practice and Experience, 24, 679-701 (1994). 6. T.B. Barker, Quality by Experimental Design, 2nd edn, Marcel Dekker, New York, 1994. 7. B. E. P. Box, W. G. Hunter and J. S. Hunter, Statistics for Experimenters, John Wiley & Sons, New York, 1978. 8. J. Gustafson, D. Rover, M. Carter and S . Elbert, ‘sla1om.c source text fife’, Ames Laboratory, Ames, Iowa 1990. 9. X.-H.Sun and D. T. Rover, ‘Scalability of parallel algorithm-machine combinations’, IEEE Trans. on Parallel and Distributed Systems, 5, 599613 (1994). 10. G. James and R. C. James, Mathematics Dictionary, 5th edn,Chapman & Hall, New York, 1992.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.