Mapping structured grid three-dimensional CFD codes onto parallel architectures

July 4, 2017 | Autor: Mark Cross | Categoria: Applied Mathematics, Applied Mathematical Modelling

Descrição do Produto

Mapping structured grid three-dimensional CFD codes onto parallel architectures S. P. Johnson

and M. Cross

Centre for Numericul London,

Modelling

and Process

Analysis,

Thames

Polytechnic,

UK

A strategy for mapping three-dimensional CFD codes with structured meshes onto highly parallel computer architectures is described. The strategy is tested on the HARWELL-FLOW3D code using parallel systems based upon transputer technology. Problems of up to 40,000 nodes are solved on parallel systems comprising up to 50 processors. Depending on the physical nature of the problem, the combination of linear solvers used, the mesh size and number of processors employed, speedup efJiciencies of up to 85% can be achieved. Keywords: CFD, parallel architectures,

parallel algorithms

Introduction Numerical models developed in the context of computational fluid dynamics (CFD) software are now used in almost every major industry sector worldwide. The development and marketing of a range of commercial software product attests to the importance of CFDbased modelling to industry.‘-5 Although CFD software is becoming more sophisticated as a modelling framework, one of the main limitations to its utility is the extent to which a geometric mesh may be refined and simulations can still be performed in a practical time scale. A decade ago, problems with 10,000 nodes were considered large scale. Recently, problems with 1,OOO,OOO nodes have been solved on supercomputers,h while those with 50,000 nodes are routinely solved on the more powerful workstations. However, if CFDbased models and associated software are to be used routinely in many industry sectors, then it will be important to solve such problems with the order of a million nodes on small inexpensive workstations in less than an hour. Such a challenge can only be addressed effectively by utilizing computing architectures with vector and/or parallel processing capabilities. Much of the practical work on CFD codes has focused upon mapping algorithms onto vector architectures as reflected in the supercomputers, e.g., CRAY and CDC. Work on linear solvers in the context of linear problems demonstrates that considerable speedAddress reprint requests to Professor Cross at the Centre for Numerical Modelling and Process Analysis, Thames Polytechnic, London SE18 6PF, UK. Received

394

5 October

1990; accepted

Appl. Math. Modelling,

13 February

1991

1991, Vol. 15, August

ups can be obtained when the solvers are appropriately vectorized. For example, Ierotheou’ has demonstrated that vectorized solutions of the potential problem using Jacobi, red-black SOR, and Jacobi preconditioned conjugate gradient solvers all yield speedups of 60-90 over their scalar counterparts. Unfortunately, this speedup does not appear to be realized in the context of CFD codes. Most workers8-1o have reported speedups of only 3-5 on vectorization. Superficially, this is a disappointing result; however, when it is realized that -10% of the computational effort is essentially nonvectorizable, it is not a surprising one.7~1’~‘2 Work on parallelizing solvers and CFD codes has also progressed over the past few years.“-I5 For a small number of processors, say, up to eight, mapping efficiencies of 50-90% have been measured. However, experiences with massively parallel machines have not been so positive. In the last few years the INMOS Transputer has grown in popularity as the core unit of parallel processing architectures. l7 Recently, the authors have explored the extent to which three-dimensional (3-D) transient solidification-by-conduction only problems could be mapped onto parallel architectures using transputer technology. 18.r9The results of this exercise, which indicated speedup efficiencies ofabout 95%, were encouraging and stimulated further work on CFD algorithms. The main results of the work on CFD algorithms are reported in this paper, which concerns the development of a strategy to map SIMPLE-type pressure correction algorithms for structured grids onto local memory parallel systems. The strategy is then tested practically in the mapping of a widely used commercial code, HARWELL-FLOW3D,4 onto a transputer-based system.

0 1991 Butterworth-Heinemann

Mapping

CFD algorithms

3-D CFD codes:

S. P. Johnson

and M. Cross

8. If included, update variables for next time step and

go to 2.

The essential equations that describe the movement of a fluid may all be expressed in the formzO $ (p@) + div (pV@ - Ia grad Q) = &,

Steps 2, 3, 5, and 6 all require a set of coefficients to be generated and a set of linear equations to be solved. Over the past two decades a vast array of variants to the above algorithm have been published and tested. While such algorithm variants may differ in how variable values are distributed (all centered, cell faces, etc.), the form of the variables solved for (e.g., velocity components or velocity correction components), or the terms included in the algebraic equations (e.g., SIMPLEC), the essential tasks outlined above are common to all derivations of the SIMPLE family of algorithms. Recent work on vectorizing SIMPLE algorithms has revealed that it is the generation of the linear equation coefficients, a& and b,, which are not easily vectorizable. Although it is possible to reconfigure the algorithms to take advantage of the vector architecture to a limited extent, a significant proportion of the computational effort required by SIMPLE and its variants, in the range I O-20%, 7,‘8,‘ycannot be vectorized. Hence if scalar SIMPLE-based CFD algorithms are to be speeded up by a factor in excess of 10, then some form of parallelization must be utilized.

(I)

where @ is the variable value at time t and position r. For a fluid moving in three dimensions subject to enthalpy and for which the k - E turbulence model is adequate, the variables and associated expressions for the set of equations to be solved for are summarized in Table 1. The domain of integration is then divided into a regular (or structured) grid distribution, and the variables solved for at each node and/or cell face center. In essence the generic transport equation (I) above is replaced by a set of algebraic equations,

(2) representing the relation of cp at the node/cell p to its neighbor nodes/cells. The coefficients, anh, and sources, b,, are all functions of the variables, (pi,at the node/cell p, and thus a solution is obtained through iteration. The most popular iterative scheme is based on the SIMPLE algorithm of Patankar and Spalding” and involves the solution of a pressure correction equation (derived from continuity). The essential steps of this algorithm are as follows:

Transputer-based

A cost-effective route to high-speed parallel processing systems involves the use of transputers as the individual processing element (PE). The T800 series transputeri6 is a chip with a small on-board cache memory, processing unit, a direct link to external memory (expandable to 32 megabytes (Mb)), and four links to neighboring PE’s (see Figure 1). The operational speed of the TSOO is 0.4-l Mflop depending on the speed of access to the external memory. Transputers may be assembled in a wide variety of ways. Most of the work at Thames has been carried out on a system of 19 transputers (comprising a board with a single TSOOand

1. Begin with guesses for u,u,w,p and k,e,h, etc. everywhere. 2. Solve the discretized momentum equations for u,u, and w. 3. Solve the pressure correction equation, p’. 4. Update U,V,MJ,and p using p’. 5. If included, solve the turbulence model equations for k and E (or other turbulence model quantities). 6. If included, solve the enthalpy transport equation for h. 7. Return to step 2 if not converged.

Table 1.

Diffusion coefficients

and source terms of the one-phase conservation

equations

0 (continuity)

1

0

u

Peff

V

Peff

W

Peff

/&/P/f

il

H

hJPfkm

k

wfl Pr*

GK - /x + Ge

E

bfflPrC

(Elk) I(Gx + G&

= pr

+

[(aday)+ (avlaxV} G=CL'S* B P

architectures

+ (dv/ayY + (aw/W*l

-

CZPEI

+ [(adad + (awiax)12 + [(&day)+ (av/azV+

ay

Appl.

Math. Modelling,

1991, Vol.

15, August

395

Mapping

3-D CFD codes: S. P. Johnson and M. Cross Floating

Polnl

Unit I

I

system Servtces

I Figure 1.

Block diagram

A strategy for mapping CFD algorithms with structured grids onto local memory parallel systems

Event

I

of transputer

PE’s used

4-Mb external memory plus two boards each with 9 T800’s and l-Mb external memory per chip). Thus the raw power of this particular system is - 11 Mflops with 22 Mb of on-board memory. This performance was obtained using a parallel FORTRAN compiler provided by 3L Ltd.22 The compiler is ANSI standard FORTRAN77 together with a series of calls which permit a range of processor control and data exchange commands, thus facilitating the concurrent operation of a range of transputers working collaboratively on a single task. Some of the runs were performed on the Edinburgh transputer parallel system, which has up to 400 PE’s, each with 4 Mb of external memory. In the application to be described below the PE with 4 Mb of memory is used as the MASTER (see Figure 2). Since transputers are local memory systems and its data transfer rate between processors is relatively slow

396

Appl.

Math.

Modelling,

1991,

Vol.

To utilize local memory parallel processing systems effectively from the end users’ viewpoint, three criteria must be satisfied: Efficient use of processing power (i.e., good speedups over scalar performance) Full use of all available memory (i.e., accommodate problems in direct proportion to the total system’s memory available) Invisibility of the “parallel” activities to the user except for the facilities of faster speed and greater memory available

of T800-20 transputer

Figure 2. Simple chain configuration in the CFD mapping procedure

(2 Mb/set), then the challenge to the algorithm designer is a more significant one than mapping onto global memory systems whose bandwidth can approach 500 Mb/set. This challenge is particularly acute in mapping CFD algorithms, because they tend to process virtually the whole problem data base time and again throughout the iterative procedure. Using a simple farming strategy, as may be appropriate for global memory systems, on transputer-based parallel architectures would result in the effective processing being brought to a halt by the data transfer traffic. Even for a model problem of, say, 10,000 cells the number of transported variables to be solved for may well be - 10, and the auxiliary information at each cell will involve -20 separate data items. Thus in each iteration of the solution procedure, 3.106 data items are operated on a number of times. If the problem takes 500 iterations to converge and/or is transient, the dominant role of data transfer traffic is apparent. As such, it would appear that local memory parallel systems could have a great deal to offer in hosting CFD applications, providing the issues of problem partition, data exchange, and processor load balancing can be effectively addressed.

15, August

If a “scalar” code is to operate efficiently when mapped onto local memory parallel systems, then a basic requirement is that the operations in large timeconsuming loops should be performed in parallel without destroying the original program semantics. An effective way of achieving this is to distribute the iterations involved in each loop to the number of PE’s available and allow them to operate concurrently. This may be achieved on local memory PE’s by partitioning the program data evenly across the number available and only performing operations that relate to the program data stored on the PE concerned. Data partitioning will only be effective if most of the data assigned to a particular PE can remain there throughout the program’s execution, without the requirement for excessive communications to send or receive data from other PE’s as required. Thus it is important to organize the data base and its partition so that most of the data required on a specific PE are located in its own memory and that data exchanged with other PE’s are kept to a

Mapping

minimum. Ideally, such exchanges should be confined to nearest neighbors; otherwise the communications overhead can quickly rise to unacceptable levels. Another requirement for a successful data partition is that the computation time between data exchanges should be similar (i.e., balanced), avoiding excessive idling time while some PE’s have to wait for others. In CFD algorithms it transpires that the computational effort at each node/cell is reasonably similar for a given problem, and so the partition strategy should involve the assigning of an equal number of nodes/cells to each PE. Note that algorithms involving geometric meshes where the number of evaluations at a node may depend critically on local conditions may not be as amenable as CFD algorithms to data partition upon mesh decomposition. Typically, in CFD algorithms based upon the SIMPLE family of procedures the calculations only involve data associated with near neighbors in the geometric mesh. Certainly, in the case of first-order approximations to terms in the differential equations, only nearest neighbors are involved, and for upwind schemes the data access pattern is very straightforward (see Figure 3). The evaluation of variables held at the control volume P is strictly independent of all but its six nearest neighbors, and this suggests therefore substantial potential for parallelism, since the majority of processing could be performed simultaneously. Assuming the mesh has a regular structure, then topologically it is cubodial in shape with II, x n, x nz cells. Three partitioning strategies arise naturally (see Figure 4). In the first a cuboidal subset of cells is assigned to a separate PE. In the second strategy the cells in a long rectangular box are allocated to each PE. In the third, planes of cells are allocated to a PE. The first two strategies require that disjoint sets of data be exchanged. Such data exchange involves either extra individual communications or bulk data sent by gather and scatter operations. Both strategies involve increases in interprocessor communications overheads. The planar partition can be constructed so that all “bulk” data exchanges involve communication only with nearest neighbors. Aside from the advantage of minimizing interprocessor communication, the planar partition also requires much less restructuring than the other approaches. The disadvantage of this approach is that the number of transputers used in a problem is limited by the number of control volume planes in the problem domain’s mesh; i.e., there is a limitation on the scalability of a problem for systems with very large numbers of PE’s. If it is assumed that the solution algorithm only uses the standard seven-point P-N-S-E-W-T-B stencil and the mapping or domain decomposition is based on x-y planes, then to effect a successful solution in the P-N-S-E-W plane only requires information from the neighboring planes containing the T and B cells. The single top and bottom plane of cells located on a PE are overlapped with neighboring PE’s. The values assigned to problem variables at each cell in these overlapping planes are supplied by solvers on adjacent PE’s.

3-D CFD codes: S, P. Johnson and M. Cross 1 CI,J,K-1)

N CI.*l,Q

E Cl-1.J.K) S C I.J*l,K)

B CI.J.K*l) Figure 3. codes

Computational stencil

used in structured

grid CFD

(A) 3-DIMENSIONAL

tB) 2-DIMENSIONAL

(cl l-DIMENSIONAL

Figure 4. Options available for partitioning the mesh and mapping it onto transputer-based parallel systems

Appl.

Math.

Modelling,

1991,

Vol.

15, August

397

Mapping

3-D CFD codes: S. P. Johnson and M. Cross

Actual domain

Partition over four processors

Figure 5. Example showing the partitioning of a structured mesh onto four processors

A diagrammatic representation of the planar partition is shown in Figure 4(c). Note that all nodes/cells are stored in one PE only, and all overlap areas obtain their information from a neighboring PE. Hence, mapping onto a four-processor system would yield a partition shown in Figure 5.

The task of mapping HARWELL-FLOW3D code onto a parallel local memory processing system The practical task addressed in this work was the mapping of the CFD code, HARWELL-FLOW3D,4 onto general transputer networks. FLOW3D is a general purpose code developed by AEA Technology which offers a range of capabilities including the following: l l

l

l

steady or transient behavior one-, two-, or three-dimensional problem domains using either regular or body-fitted meshes laminar or turbulent flow (the k - E and other models are built in) enthalpy and other concentrations to be transported

The code has a range of solver options available but is based around the SIMPLE pressure correction approach with structured grids. The particular solution approach utilizes the Rhie and Chow23 procedure, which avoids the necessity for staggered grids, and all variables are defined at the cell center.

Some routines, such as the final maximum, residual in the linear equation solvers, necessitate the evaluation of a scalar value to be evaluated on each PE. This is simply implemented in parallel by evaluating the prescribed scalar value (PSV) on each PE relevant to its assigned data set, then passing these values to the MASTER-PE in the chain which evaluates the PSV for the whole data set. This value is then dispatched back to each PE, where relevant. The vast majority of the communication time required in the parallel code occurs when the data relevant to the overlap areas on each PE is updated. This has to be done whenever the cell/node variables in an overlap area of a PE are updated as one of the assigned cell/node variables on a neighboring PE and before the values in the overlap area are used in any computation. Since the overlap areas hold a large quantity of data, which require frequent updating for many different variables, it is essential that an efficient parallel routine be used. When an overlap area requires updating, each PE (except the first and last in the chain) must receive data relating to a variable on both the upper and lower overlap areas. It must also send assigned values of the variable concerned to neighboring PE’s for use in its solution algorithm. Thus four bulk communications are required on every PE, each one concerned with an xy plane of data. These communications may be performed in parallel using three subsidiary “threads” on each PE (i.e., routines that run concurrently with the main code sharing the processing effort among them). Parallelization of linear equation solvers The entire solution procedure has been parallelized, using the planar partition described above. Such a partition causes little problem for the explicit operations, such as setting up the coefficients for the discretized PDE’s and the correction of pressure and velocities. However, the linear equation algorithms have to solve a set of interconnected equations; many of these algorithms are implicit, involving recurrencies, and hence require rather more attention to be efficiently parallelized. Three of the linear equation solvers provided in FLOW3D have been examined here: l l l

Code alterations The most straightforward mapping of a scalar code onto a local memory parallel processing system would be where each PE executes the scalar code, but only on the set of data assigned to it, together with a minimum of interprocessor data interactions to preserve scalar code dependencies. To achieve this kind of mapping, the majority of code alterations involve array declarations and loop limit alterations. The array bounds are adjusted to include only the assignment area and overlap area (where appropriate) of each variable on a particular PE, while the loop limits are adjusted to execute only those iterations that assign values in this PE’s assignment area.

398

ADDI. Math. Modelling,

1991, Vol. 15, August

Stone’s implicit procedure (SIP) line relaxation solver (LSOR) conjugate gradient with diagonal (PCCG)

preconditioning

These solvers represent the whole range in the way in which they operate. The CG procedure involves only explicit calculations and can be mapped in the same way as the coefficient evaluation. As such, apart from some communication overheads, the calculations are essentially the same in both scalar and parallel, and it is expected that the convergence behavior will be identical. The Stone’s solver uses recurrences to form its iteration matrix as well as to perform both forward elimination and backward substitution. In each case, three nested loops are used; the inner loops pass over the nodes along the matrix diagonals, while the outer loop

Mapping Serial code

DO ISLAB= ,NSLAB DO J=.... DO K=.... . ...

Palallel code

DO J=.... DO K=.... Receive value(s) DO JSLAB=NLOW,NHJGH .. ..

ENDDO

END DO

END DO

Send value(s)

END DO

EMDDO

3-D CFD codes: S. P. Johnson and M. Cross

The first option yields a strategy similar to that employed in mapping the SIP procedure. Hence the second was used to demonstrate the performance of this kind of procedure. Of course, the success of this latter strategy depends on the increase in the number of iterations required to generate a solution of the linear system that has the same accuracy as the scalar SOR. In the tests shown in the next section, results are shown for the SIP and LSOR solvers used for all variables and for SIP-CG and LSOR-CG combinations where the CG solver is used for the pressure correction equation only. This is obviously because the CG solver can only cope with symmetric matrices.

ENDDO

Figure 6. Example of some of the typical code modifications to FLOW3D in mapping onto the parallel architecture

Performance of the parallelized HARWELL-FLOW3D

version of

Two fairly standard problems were selected to test the effectiveness of the parallel mapping of FLOW3D. The problems-backward facing step (B-step) and moving lid in a cavity-are illustrated in Figures 8 and 9. In

original Serial ordering

Rtordercd parallel

Figure 7. Original serial cell ordering and the parallel reordering of the structured mesh for mapping onto the transputer system Figure 8.

is inherently serial and passes over the xy slabs in the z-direction. To exploit parallelism in the SIP solver for the slab-based partition, a reordering of nested loops is required. The outer slab loop is exchanged with the inner loops to become the inner loop (see Figure 6). This then permits the parallelism in the now outer loops to be exploited while the inner slab loop operates in a pipeline. As shown in Figure 7, the semantics of the scalar code are preserved (i.e., each cell has the same set of previously updated neighbor values), but it performs the calculations in a different cell order. The scalar line solver is an SOR algorithm, where all the iterative updates of course use the latest values available. However, to map such an algorithm onto a parallel architecture means that either (1) idle time is introduced while the required values are evaluated on other processors or (2) the use of old values is permitted for a certain minimum number of iterative updates, thus allowing all processors to start simultaneously.

Illustration of the geometry of the backward facing step problem

and boundary conditions

-

-_

/

Figure 9.

Illustration of the geometry and boundary conditions of the moving lid cavity problems. Three-dimensionality is induced by the boundary layer effects at the walls

Appl.

Math. Modelling,

1991, Vol. 15, August

399

Mapping

3-D W-D codes: S. P. Johnson and M. Cross

each case, only laminar flow has been considered, and the fluid is assumed isothermal and incompressible. The problems have fairly simple three-dimensional geometry, but no assumptions have been made to exploit this simple geometry, and so the software should still be able to handle arbitrarily complex geometries with the same parallel efficiencies as those reported below. In vectorizing SIMPLE-based CFD codes in two dimensions it was clear that the evaluation of the pressure correction coefficients and the solution of the resulting linear system were the dominant components of the procedure in scalar mode. However, it is clear from Table 2 that although the pressure correction coefficient evaluation and solution constitute the single largest component in the solution of both problems, no matter what linear solver is used, it is not as dominant as in the two-dimensional problems. Figures 10-14 show the speedup versus the number of processors for the four pressure correction linear solvers on the B-step problem. The general trends of the results are as expected; that is, larger problems are more efficient with respect to speedup. The line solver is the least efficient in parallel, as expected, since as the procedure is made explicit to assist parallelization, then the convergence performance degrades over scalar. In the B-step problem on 50 parallel processors the Stone’s solver is about 65% efficient, while the line SOR solver is about 50% eflicient. However, using a conjugate gradient solver, the pressure correction equation improves matters by -15% for the line SOR solver and -20% for the Stone’s solver. One reason for the relatively poor performance of the parallel line SOR scheme as opposed to its scalar version in the Bstep problem is that the flow is essentially parabolic (i.e., predominantly unidirectional). In scalar this is exploited by sweeping in the predominant flow direction, propagating the information from inlet to outlet in one iteration. However, in parallel, each sweep is disjoint between processors, and it therefore requires a number of iterations equal to the number of processors before the influence of the inlet is felt at the outlet. This is reinforced by Figure 14, which shows the increase in total iterations required by the parallel line based solvers in comparison with their scalar counterparts. Note that although the LSOR when used everywhere requires about 10% more total iterations, when combined with the CG solver for the pressure correction this rises to 30 f %. It also induces a rise in the number of CG iterations to achieve convergence.

As such, it is no surprise that the speedups shown in with those in Figure 12. Nevertheless, efficiencies up to 70-75% may be observed from the timings made. The moving lid cavity problem was only run on up to 18 parallel processors (see Figures 15-19). For the larger problems here the efficiency is approaching 85%. Notice that the line SOR solver here is the most effective; this is not surprising, because the elliptic nature of the flow does not have a predominant flow direction to be exploited by the scalar line SOR. The increase in iterations using the disjoint parallel scheme over the scalar scheme is much less than on the B-step problem. This is again confirmed by the percentage increase in solver iterations as plotted in Figure 19. Although the increase in LSOR iterations on its own is similar to the B-step problem, when combined with the CG solver the degradation in performance becomes marginal. Perhaps, oddly, if the pressure correction equation is solved using a conjugate gradient algorithm, the performance actually degrades a little. In fact, it is not always easy to obtain convergence on the solution of the pressure correction equation using conjugate gradient solvers. The performance of CG solvers on CFD “SIMPLE’‘-based procedures is rather erratic, so the result is not as surprising as it may be assumed superficially. Note too that the Stone’s solver is less effective on this problem with a speedup of efficiency of around 70%, which is improved slightly to -85% on the larger problems when the pressure correction equation is solved using a CG procedure. A word of caution before proceeding further. In the B-step problem the 28,900-node specification could not be run on anything less than a IO-transputer system (because of memory requirements). The 40,000-node problem could not be run on less than a 20-processor system. Similar constraints applied to the larger nodal density moving lid cavity problems. So to obtain a speedup against the scalar performance, the efficiency was plotted for the larger runs and extrapolated backward. Since the efficiency plots were reasonably linear (see Figures 20 and 21), the data presented should be reasonably consistent. Of course, the degradation in performance of the line-based solvers on the B-step problem using 20 + PG’s merely indicates that the convergence performance has degraded more than indicated in Figure 14 above on the smaller problem. At this stage it is worthwhile asking where the loss of performance occurs. In assessing performance degFigure 13 are poor compared

Table 2. Distribution of percent computational effort in each stage of the solution process for the model problems considered where a range of linear solvers is used B-step problem (8640 nodes)

400

Appl.

Moving lid cavity problem (7056 nodes)

Item

SIP

LSOR

S-CG

L-CG

SIP

LSOR

S-CG

L-CG

Coefficient and source u,v,w Solution u,v,w Coefficient and source p’ Solution p’ Update u,v,w,p

17.3 30.5 10 33.1 8.8

16.8 16.1 9.8 48.7 8.6

19.8 35 11.5 23.4 10

24 23.1 14 26.5 12.3

21.3 37.1 12.5 18.3 10.8

16.1 20 9.4 46.3 8.2

17.4 30.4 10.2 33.1 8.9

18.6 23 10.8 38 9.4

Math. Modelling,

1991, Vol. 15, August

Mapping

Figure 10. Speedup results from a range of simulations using FLOW3D on the B-step problem, using SIP as the solver for all variables

3-D CFD codes: S. P. Johnson and M. Cross

Figure 13. Speedup results for FLOWBD on the B-step problem using CG for pressure correction and LSOR for all other variables

tamer of 0

Figure 11. Speedup results for FLOWBD on the B-step problem using CG on the pressure correction and SIP for all other variables

Figure 12. Speedup results for FLOW3D on the B-step problem using the LSOR solver for all variables

LIZ-U

l

Lb03 _

TTmco”tPs. Lllm

0

LLQi

-

CG

Figure 14. Increased computational load on the parallel implementation of the line and L-CG solvers on the B-step problem with 8640 cells

Figure 15. Speedup results for FLOWBD on the moving lid cavity problem using SIP for all variables

Appl.

Math. Modelling,

1991, Vol. 15, August

401

Mapping

3-D CFD codes: S. P. Johnson and M. Cross I-

,I

-

,,

-

I3D 0

-

,.

-

7 -

r-

33 -

5 -

v 4 7,

-

-

(0 -

1-

9 I,D,-

.I

.0

s

’ D

m

0 30

w0

xm

l

0

’

3

w

’

n

0

’ ,I

t

0 ?I

1 3

OT n-oxssm. 37.H

0

31.00

.

466%

Figure 16. Speedup results for FLOW3D on the moving lid cavity problem using CG on the pressure correction and SIP for all other variables

Figure 19. Increase in the computational load on the parallel implementation of the line and L-CG solvers on the moving lid cavity problems with 7056 cells

0

Figure 17. Speedup results of FLOW3D on the moving lid cavity problem using LSOR for all the variables

510~

l

noll.ac

at

W.-r.. ..

Lam

.

Llnm

.a

Figure 20. Example of the efficiency of the mapping of FLOWSD using various linear solvers on the backward step problem with 17640 cells

1D 3,

-

t*

-

3,

-

34

-

,,

-

DD,DI-

Figure 18. Speedup results for FLOWBD on the moving lid cavitv problem using CG on the pressure correction and LSOR for all other variables

402

ADDI. Math. Modelling,

1991, Vol. 15, August

Figure21. Example of the efficiency of the mapping of FLOWBD using various linear solvers on the moving lid cavity problem with 17,424 cells

Mapping

3-D CFD codes: S. P. Johnson and M. Cross

n

t

60 -

I

I

I

5

10

I

20

IS

Nurbef

0

Comm

h

Figure 22. The computational effort involved in communication, nodes when SIP is used on all variables

25

I

25 Of

I

30

35

I

I

40

4s

f?oceSSoTS.

Idle.

+

otner.

and other factors per processor on the B-step problem with 17,640

1

20

15

IO

S

0

I

I

1

0

IO

I

I

I

I

20

30

40

50

NW 0 Figure 23.

8640

+

Of

PfOCeSSWS.

17640

Percentage time spent in all but the computation

0

29800

h

40000

tasks for the B-step problem

Appl.

Math. Modelling,

using SIP on all variables

1991, Vol. 15, August

403

Mapping

3-D CFD codes: S. P. Johnson and M. Cross .op Time.

2174

i

5

10

30

15

45

Number Of Processors.

m

~dculdIon

aa Comma

I

Idle

0

ottlerr

Figure 24. Breakdown of the effort involved in the solution of the B-step problem with 17,640 nodes where SIP is used on all variables

radation there are three factors to take into consideration: the communication of information between processors at various stages of the iterative solution the imbalance of the mapping and therefore the idle time of processors during iterations the time spent in duplicate calculations or in the processing of additional instructions. Figure 22 shows a typical example of how these factors

vary with the size of the grid for the Stone’s solver on the B-step problem. Because of a limitation on the compiler we could only record the combined communications and processor idle time. Note that the overhead degradation in performance remains remarkably constant above a certain number of processors (the actual number being dependent on the problem, of course). Although the interprocessor overheads remain constant in absolute terms as the number of processors increase, they obviously increase as a percentage of the total elapsed computational task. This is illustrated in Figures 23 and 24 for the B-step problem.

In this contribution we have attempted to show how structured mesh 3-D CFD codes can be effectively mapped onto highly parallel computer architectures. It

Appl. Math. Modelling,

Acknowledgments The authors wish to thank AEA Technology, Harwell, for making available the HARWELL-FLOW3D code and the Edinburgh Parallel Computing Centre for making their system available. They also wish to acknowledge the contributions of Peter Leggett, Cos Ierotheou, Ian Jones. and Suzanne Simcox in sorting out the inevitable problems of a task of this kind.-S. Johnson wishes to acknowledge SERC for a research studentship. References

Discussion and conclusion

404

turns out that the simplest approach one might take, that of mapping two-dimensional layers of the mesh onto processor elements in a simple chain configuration, and organizing the calculation so that it communicates only with neighboring elements, appears to be very effective. Our practical experience in mapping HARWELL-FLOW3D onto transputer-based systems demonstrates quite clearly that parallel efficiencies of up to 85% can be achieved. In this paper we have described the mapping of a nearest-neighbor sevenpoint stencil scheme with Cartesian geometry. In fact, the whole procedure as described carries directly over to cylindrical or spherical polar and more general bodyfitted coordinate systems for the seven-point stencil schemes. For numerical schemes that involve higherorder approximations the procedure is a little more complicated with the overlap x-y (or more generally, i-j) planes being 2 or 3 deep. However, the strategy used remains identical to that described in the body of this paper. As the scalar processor power of workstations approaches 20 Mflops (see, for example, the IBM RS6000 series), it is likely that transputer-based parallel systems, where even a 40-processor configuration gives only 16 Mflops of raw power, will not prove cost effective. However, since even 20-Mflop scalar workstations will be inadequate for the million-node problem, then the need for parallel systems where the processing power of each element is 20 Mflops is evident. Indeed it could be argued that the optimal solution for CFD codes is a computer architecture that consists of a number of processing elements where each has a vector capability. With the emergence of the INTEL i860 chip, and performance characteristics of 8 Mflops (scalar) and 40 Mflops (vector), it would appear that combined parallel/vector systems will shortly be available as super workstations. Since SIMPLEbased CFD solution procedures can be made to yield speedups of 5 on vectorization, then it may turn out that systems embedding i860 (and other similar) chips will prove to be the ideal solution for CFD practitioners.

1991, Vol. 15, August

1. 2. 3. 4.

PHOENICS, CHAM Ltd, London, UK FLUENT, Flow Simulation Ltd, Sheffield, UK FIDAP, FDI Inc., Evanston, IL HARWELL-FLOW3D, AEA Technology, HarweU. Oxon, UK

Mapping 5. 6.

7. 8.

9.

10.

11.

12.

13. 14.

ASTEC, AEA Technology, Harwell, Oxon, UK Rizzi, A. and Purcell, C. J. Inviscid vortex stretched turbulent shear layer flow computer around a cranked delta wing. Commun. Appl. Numer. Methods 1986, 2, 139 Ierotheou, C. S. The simulation of fluid flow processes using vector processors. Ph.D. Thesis. Thames Polytechnic, 1996 Kordulla. W. Vectorisation of aleorithms in CFD on the CRAY I vector computer. Vectorisation of Comparing Programs with Applications to CFD, ed. W. Gentzsch, Viewing Publishers, 1984, pp. 157-171 Barrel, M. et al. Implementation of 3D explicit Euler codes on a CRAY I-S vector computer. The Efficient Use of Vector Computers with Emphasis on CFD, eds. W. Schonauer and W. Gentzsch, Vieweg Publishers, 1985, pp. 47-65 Koppenol, J. Simulation 3D Euler flows on a CYBER 205 vector computer. The Efficient Use of Vector Computers with Emphasis on CFD. eds. W. Schonauer and W. Gentzsch. Vieweg Publishers, 1985, pp. 71-92 Ierotheou, C. S., Richards, C. W., and Cross, M. Vectorisation of the SIMPLE procedure for CFD problems. Part I: A basic assessment. Appl. Math. Modelling 1989, 13, 524-529 Ierotheou, C. S., Richards, C. W., and Cross, M. Vectorisation of the SIMPLE procedure for CFD problems. Part II: The impact of using a multigrid method. Appl. Math. Modelling 1989, 13, 530-536 Rodrigue, G., ed. Parallel Processing forScientific Computing. Society for Industrial and Applied Mathematics, 1989 Brebbia, C. A. and Peters, eds. Applications of Supercom-

15. 16.

17. 18.

19.

20. 21.

22. 23.

3-D CFD codes: S. P. Johnson and M. Cross puters in Engineering: Fluid Flow and Stress Analysis Applications. Elsevier, New York, 1989 Carey, G. F., ed. Parallel Supercomputing: Methods, Algorithms and Applications, Wiley, New York, 1989 Patel, M. R. et al. A parallel numerical simulation for supersonic flows using zonal overlapped grids and local time steps for common and distributed memory multiprocessors. Applications of Supercomputers in Engineering: Fluid Flow and Stress Analysis Applications, eds. .C. A. Brebbia and Peters, Elsevier, New York. 1989, DD. 89-104 The Transputer, INMOS Ltd. Bristol, UK Cross, M., Johnson, S., and Chow, P. Mapping enthalpybased solidification algorithms onto vector and parallel architectures. Appl. Math. ModeUing 1989, 13, 702-709 Johnson, S., Cross, M., and Leggett, P. Casting simulation on highly parallel computer architectures. Proceedings of the International Conference on Modelling of Casting, Welding and Advanced Solidification Processes, Davos, Switzerland. Sept. 1990 Patankar, S. V. Numerical Heat Transfer and Fluid Flow. Hemisphere, 1980 Patankar, S. V. and Spalding, D. B. A calculation procedure for heat, mass and momentum transfer in three dimensional flows. Int. J. Heat Mass Transfer 1972, 15, 1787-1808 Parallel FORTRAN, 3L Ltd, Livingstone, Scotland Rhie, C. M. and Chow, W. L. Numerical study of turbulent flow past an aerofoil with trailing edge separation. AIAA J. 1983, 21, 1525-1532

Appl. Math. Modelling,

1991, Vol. 15, August

405

Lihat lebih banyak...

Mapping structured grid three-dimensional CFD codes onto parallel architectures

Descrição do Produto

Comentários