A new finite element concurrent computer program architecture

June 16, 2017 | Autor: Charbel Farhat | Categoria: Engineering, Finite Element, Computer Program

Descrição do Produto

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BNGINLERING, VOL.

24, 1771 1792 (1987)

A NEW FINITE ELEMENT CONCURRENT COMPUTER PROGRAM ARCHITECTURE CHARBEL FARHAT Department ($ Aerospace. Engineering and Cenfrr,for Space Structtrws and Controls, Unirersitj 01 Colorado at Boulder Boulder. CO 80309 0429, U.S.A.

AND EDWARD WILSON Department qf Ciziil Engineering, Division qf’Szructuru1 Engineering and Strirctural Mechanics, Uniliersity Berkeley, Herkelev, CA 94720, U.S.A.

of

California at

SUMMARY A new computer program architecture for the solution of finite elemet systems using concurrent processing is presented. The basic approach involves the automatic splitting of an arbitrary spatial domain. Processors are dynamically reassigned during the several phases of an analysis. Direct and iterative solution strategies are considered. Computational algorithms for finite element dynamic analysis of large-scale structural problems that exploit concurrent features of MIMD computers arc implemented in modules around the basic architecture. Also, problems with localized non-linearities are treated. A first implementation on a 32-processor hypercube achieves efficiency rates of up to 90 per cent.

1. INTRODUCTION

Recently, a respectable body of theoretical literature has emerged on ‘parallelizing’ the finite element method. Most of the research community appears to be focused on parallel equation solvers (Jacobi-like, SOR-like, conjugate gradient,. . .) and/or cxtraction of eigenvalues.lP6 Here we treat the problem as a whole rather than thc sum of disconnected parts. We define a ncw concurrent computer program architecture capable of handling linear and non-linear. static and dynamic analyses. We design it with enough flexibility to be implemented on the whole class of local memory multiple instructions multiple data streams machines (MIMD). The new program includes several special purpose modules which constitute optimal algorithms for solving specific subproblems such as domain decomposition, direct and iterative solution of systems of equations, solution of transient dynamics and adaptive mesh refinement. These modules are designed to be consistent with the topology and data structure of the overall program architecture so that a maximum overall efficiency can be achieved during a complete finite element analysis. Figure 1 summarizes this approach. 2. ARCHITECTURE O F A CONCURRENT PROCESSING PROGRAM

In structural analysis the technique of dividing a large system into a system of substructures is very old and is still used extensively. For large aerospace structures, its use is often motivated by the fact that different components are designed concurrently by different groups or companies. Therefore only the basic static and dynamic properties of the substructure need to be communicated 0029-5981/87/091771-22$11.00 01987 by John Wiley & Sons, Ltd.

Received 2 December 1986 Revised 8 January 1987

1772

C . FARHAT AND E. WILSON A NEW r l h l T E ELEMFNT r O U C U R R l h T COMPIITtR PKUGK4M A R L H l l t C T U R t

a ; MESH OENLRATIQN

A U l O M A T I C DOMAIN I>ECOMPOSITIOh

DISTRIRI TED AKCHITLCTIIRE

, P,-D

Citcmetncdl D m Basic Static nnd Djiinmic Element Properllrs Cvnirol Parameters IN,. j NEW DATA STRUCTURES

CONCURREhT nYNAMIC AN4L’lbIS

COhCl‘RRFNT PROBLEM SE1 UP

w STATIC SOLUTION I.QOKAHE4D

STATIC ANA1,YSIS MODULES

DYNAMIC ANALYSlS MODUI FS

FORMATION O r SLIBI~OMAIYSTllfNEbS A R M > \

I O R M A I I O ~or ~ L I L I V O M A ILOW N ~RRAY FORMATION 01.SIJRDOMAIN MA\? RL DAUPJYG ARRAYS

I

NO INTERPROCESSOR COMMUNICATION SO FAR

DATA STRI’CTI’RF COMP&TIBILlTY tMlNiMUM OVERHLAD

I Figure 1. Overall organimtion

between groups. This approach has also resulted in computational saving. If the design of one component is changed only that wbstructure nccd, to be re-analysed and the global system of substructures rc-solved. In case of limited non-linear systems. only the substructures which are non-linear need be studicd incrementally with time. It is clear that this traditional substructuring approach can be used with concurrent processors if thc completc finite element system IS subdivided so that each group of elements within a small domain is assigned to one processor. It is also apparent that the data structure for such an approach is very simple, since only the node geometry and element properties within the subdomain need be stored within the RAM of the processor aTsigned to that sub-

CONCURRENT COMPUTER PROGRAM AHCHTTECTIJRE

1773

domain. In addition, concurrent formation and reduction of the stiffness matrix for that region rcquirc no intcrproccssor communication. After the displacements are found, the postprocessing of subdomain stresses can be done concurrently.

3. AUTOMATIC DOMAIN SUBDIVISION The first step involved in this approach is to automatically subdivide an arbitrary finite element domain into the same number of subdomains as lherc are available processors. The objective of such partitioning is to create subdomains which require approximately equal time to form,

Figure 2. Finitc elcmcnt mcsh

Figure 3. Automatic domain subdivision: (a) 2, processors: (b) 4 processors; (c) 6 processors

1774

C F A K H A T A N D E. W I L S O N

reduce and pmtprocess in order to balance the coniputational load anlong the processors. The new algorithm which accomplish this has been presented in Reference 7. I t is summarized in box 1 (see Appendix I) and applied to the planar wing shown in Figure 2. Triangular elements with 3 and 6 nodes as well as quadrilateral elements with 4, 5 and 8 nodes are used to discretize the domain. The resulting subdomains whcn 2,4 aiid 6 proccssors are activatcd arc shown in Figure 3.

4. BASIC EQUATIONS FOR THE COMPLETE FINITE ELEMENT DOMAIN In Figure 4 an arbitrary finite element domain is shown automatically subdivided into a system of subdomains LIj. The mesh nodes which are common to the subdomain interfaces define a unique global intcrface denote D I . They are automatically identified by the program. The nodal point unknowns within the subdomains arc numbcrcd first?and the intcrface nodes are numbered last. For a problem in static structural mechanics the resulting stiffness matrix K will have an ‘arrow’ pattern. Each diagonal submatrix K5>( j ) represents the local stiffness of a subdomain D j . An off-diagonal submatrix K,,(.j) denotes the coupling stiffness betwccn D j and the interface

Figure 4. Automatic splitting oC an arbitrary F E M dornain

K,,(l) \

\ \

--

\

----

LplFl

i-- -\

SY M M ETR I C A L

\

L R O W i IN PROCESSOR mod(i

Figure 5. Mapping of the processors fur skitic w l u t i o n

- 1, Np)

CONCURREV I COWPI TFR PROGRAM ARCHITECT t J R F

177s

D,. Block K,, is the stiffness associated with D,. Consequently, the vector of unknown responses U is partitioned into subvectors Lrj which correspond to thc dcgrccs of freedom lying in region D j ; similarly, F , denotes the part of the loading vcctor F associated with Dj. The diagonal blocks K b s ( j are ) uncoupled and can be factorcd concurrently. Each processor p j is mapped on to a subdomain Dj. It forms, assembles and stores one diagonal block K J j ) , its corresponding coupling term K,i( j) and the loading array F,. Notc that this partitioiiiiig~~~mappirlg scheme, summarized in Figure 5 , is such that for any arbitrary finite element domain. the communication structure i s always predetermined (and actually invariant). 5. D A T A STRUCTURE

Since all the subdomains D j are separated by the interface, they constitutc a set c?f disconnected meshes that can bc numbered independently. The selected data structure for each diagonal block K S 5 ( , jis) the profile typc. A n intcrnal renumbering scheme is applied separately to each subdomain in order to minimize the storage Moreover, the pattern of subdivision discussed previously is such that all sub-meshes can be renumbered concurrently without any message transfer bctwccn processors. This saves a considerable amounl of time in the preprocessing phase. The coupling blocks Ksi(,j)are usuaily vcry sparse: only the non-zero values arc stored. In addition to Kss(j ) , KSi(j) and F j , each processor holds a set of rows of KII which is the stiffness associated with the subdomain D,. 6. THE CONCURRENT-PAR.4LLEL SOLU'I'ION ALGORITHM (CP)

'l'he algorithm for the solution of the complctc h i t c element system is given in box 2 (see Appendix 11). After each processor has completed the formation of K g ( j ) and F,*(,j), these arrays are concurrently redistributed antong the processors such that row i of K , , is stored in processor # mod(i - 1, N,,).') The global set of equations which must be solved in phase 2 can be very large. Therefore, a spccial parailel equation solver has been developed in order to minimize solution time for this phase; it featurcs a concurrcnt LDC' factorization as well as parallel forward and backward substitutions which will be extensively used in dynamic analysis modul~ s. '. ~ Finally, after each processor pj has back substituted for its displaccmcnts, it continues with a trivial concurrent stress evaluation free from any interprocessor communication. So far, the reader is given the impression that this substructuring approach has been tailored t o suit dirccl solutions. Tn the next section, we show how a general and efficient iterative scheme is easily built around this architecture and benefits from the uncoupling between the K q F ( jblocks. )

7. A CONCURRENT ITERATIVE STRATEGY FOR VERY LARGE SYSTEMS It is widely accepted that for very large three dimensional finite element systems with, say, over 10 000 equations and large bandwidth, iterative solution techniques compctc with dircct schemes from the computational time aspect. When these systems arise, their sparsity, and hence their low storagc rcquircinents, constitutc an additional motivation for iterative solutions. Today, as commercial multiprocessors are becoming available, the scientific community is focusing on 'parallelizing' these schemes for faster so1utions.l0- In Reference 11, the iterative scheme is not guaranteed to converge since it is a Jacobi type iteration. Hence, it cannot be implcmented on a general purpose finite element concurrent program. In References 12 and 13, a multicolour SOR iteration is described. It allows a Gauss--Seidel typc of itcration to be carried out concurrently in the same fashion as Jacobi iteration. Convergence is guaranteed, but the scheine is applicable

1776

C. YARHAT AND E. WILSON

only to problems where the geometrical domain is a rectangle or some other regular two or three dimensional region, and provided that the same discretization pattern is repeated at each grid point. Unfortunately, most analysed structures d o not fit within this category of rectangular or regular domains (nozzles, shells, free-form surfaces and volumes, etc). Also, it is common practice in finite elcment modelling to use several types of elemcnts with varying numbers of nodes (e.g. for adaptive mesh refinement, composite structures, skew plates, zones of stress concentration), so that the pattern of discretization is seldom repeated at all mesh points. Here we propose an alternative to the niulticolouring technique that allows almost parallel SOR iterations, without any constraint on the geometrical domain to be analysed and without any restriction on the pattern of discretization. Moreover, the proposed strategem derives naturally from the general finite element concurrent program architecture which has been described so far. While the multicolouring approach results into local uncoupling, the substructuring approach ) mutually uncoupled. Let each processor yields a global uncoupling: the diagonal blocks K q q ( j are hold the pair { K s s ( j ) K, s i ( , j ) } as well as F j and U j . Then it is clear that the updates {U:k+'), j = 1,2,. . ., hJ,)can be all performed concurrently by solving in parallel the subsystems K S S ( . j ) b y 1)+ = F,j- Kyi(j)Lqk) j

=

I , 2 , . . ., N ,

the components Up' ''(i) arc updated locally in each processor using sequential SOR iterations. Tt remains to update the unknowns associated with block K,, which is coupled to all of the Kss(,j)terms. D, is resubdivided in the same rashion as D, into a set of disconnected subdomains Dic2,tied together with a common interface DlCr),where the subscript 2 indicates a second pass of subdivision. Recursively, the sanie pattern of partitioning is carried out until whatever happens first: (1)there exist no more a level of subdivision where at least two subdomains D,,,, are disjoint o r (2) tho sizc of Dt(vzJ (number of degrees of freedom attached to it) is less than a critical value N,, specified by the user. The algorithm that executes this recursive substructuring is an extension of the one presented in box 1. A map colouring technique is added so that no two subdomains D,ic,,J are connected. The final pattern of the stiffncss matrix and the processor assignment are shown in Figure 6.

LEVEL N s J

' 'Ilf Ns)

Figure 6. Patterii of the stiffness matrix after N , recursive subdivisions

1777

COYCURRENT COMPI:TTR PROGRAM ARCHITECTIJRT

bigurc 7. Automatic recursive subdivision for iterative solution

Let 11s considcr thc wing of Figure 2, where 3 and 6 node triangular elements as well as 4, 5 and 8 node quadrilateral elements are used to discretize the domain. Thc total number of degrees of freedom is 21 8. The resulting partitionings when 2.4 and 6 proccssors arc activated are shown in Figurc 7. For tlic two processor case. after LWO levels of subdivision 19,~~) contains o d y one node. Thus: only 2 out of the 218 equations are serialized at each global iteration. When 4 processors are mapped on the mesh, thrcc lcvcls o f subdivision arc carried out and D1(,,is empty. in this case, no serialization occurs and the scheme becomes perfectly parallel. The detailed implementation of the concurrent Block SOK iterations has been presented in Reference 14. 8. CONCURRENT DYNAMIC ANALYSIS A method of dynamic analysis for systcnis subjected to a fixed spatial distribution of the dynamic load was introduced by Wilson et d.,"as an economic alternative to classical mode superposition. It is based on the superposition of a special class of Ritz vectors generated from the spatial distribution of the dyiiainic load. Thcsc vectors are evaluated by a simple recurrence algorithm at a fraction of the computational effort required for thc calculation of exact mode shapes. They are applied to transform the dynamic equilibrium equations of the FEM formulation

MU

+ CO + KC' = Fjs, t )

to a reduced system of equations that can be written as

M * Y + c*1+ K * Y = F*(s, t ) where

U = X.Y;

= X'iVfX;

C* = XTCX; K*

=X'KX;

F*

= XTF

and M*, C*, K* are square matrices much smallcr in size than the original square matrices

C. FARHAT 4 Y D C. WILSON

1778

&I, C, K . By exploiting h e similarity between the vector generation algorithm uscd in this Ritz type reduction method and the Lanczos method,16 it becomes possiblc 10 form the reduced tridiagonal matrix T,= K * - directly from orthonormalization cocfficients calculated while generating the vector basis. The main advantages of this strategy are to avoid the explicit calculation of the transformation XT#X and to further minimize the bandwidth and storage requirements of the generalized co-ordinatc system. This variation applied to the original formulation of Reference 15 is prcscntcd in Box 3 (see Appendix 111). A concurrent implementation of this variation on MIMD machines that has been detaiied in References I 7 and 18 is summarized in the following section.

’

8.1. A concirrwit implrmenlalion of tlze niodified W Y D tcchrziqrre

The algorithm shown in Box 3 consists cssentially of four phases: (1) triangularization or the stiffness matrix K: (2) generation of WYD vectors X and construction of T ; (3) solution of the reduced eigenvalue problem

T,%- ZQ

(4) computation of filial Ritz vectors and corresponding freqiicncics

x,.= xz 1 U.12 1 = [I/Q] For linear analyses. phase 1 I s done only once. The algorithm (CP) described in Section 6 is used for this purpose. Once phase 1 is completed, all processors are ready for the next task. For maximum efficiency, the host (or Manager) is used as an additional processor in the execution of the remaining phases. It is mapped on the ‘interface’ subdomain D,. If the analytical rnodcl uscs a lumped mass matrix M , then each processor p i stores in addition to the arrays mentioned above, a vector M j corresponding to the point masses attached to the FEM nodes of subdomain D j ; the Manager holds M,.If a consistent mass matrix is used to model thc structure. then M‘ has the same pattern of K and each proccssor is assigned a pair of blocks Ms5(.j), f\f5i(j);the Manager holds Mil. However, most structural finite element dynamic analyses assume a lumped mass matrix. For the sake of simplicity, we will assumc that iZ/I i s diagonal. Figure 8 shows the mode of storage of the relevant arrays for the dynamic analysis. ’The first step in phase 2 consists of the solution

‘ROWj-,MOD(i-l,lup) Figure 8 Mode of slordge of dirkrent drrdja

CONCURRENT COMPUTER PROGRAM ARCHITECTURE

1779

of the triangular system k'X('' = M F - I i .This is parallelized as shown in box 4(see Appendix IV). The other steps of this phase are purely vector manipulations consisting of inner products and scaling. On multiprocessors with local memory, an inner product is carried out concurrently as follows. Recall that each processor p,i stores in its local memory a part of X , namely xj,of length I,. First, a local dot product is performed for those parts of the vectors in each processor. Thcn. the accumulation of these local dot products is done in log, (N,) stages, following a binary tree. As rioted earlier, the 7;matrix is tridiagonal symmetric of size r x r, where r is the selected number of WYD vectors to be generated. Further details on the choice or evaluation of r are given in References 19 and 20. Only the fact that r is usually very small is relevant to this implementation. Hence, ?; can be duplicated concurrently with a very low storage cost in each processor; as a set of two one dimensional arrays, storing respectively the main and the upper diagona I s. The soiution of the reduced eigenvalue problem T,.Z = ZR is the object of phase 3 of our stratcgein. This phase of the overall analysis does not involve a significant amount of computational effort since the system is tridiagonal. Howcver, it can be further reduced by the use of concurrent processing. Thc reader should rccall that: at ihis stagc, cach processor holds in its local memory a copy of the tridiagonal symmetric matrix T,. A simple algorithm for a concurrent extraction of all ofthe eigenvalues of T, has been dcscribed in Reference 17. Basically, the spectrum of eigenvalues is divided into N , +- 1 subintervals [s, e,] of equal length. The task of each processor p is to compute the eigenvalues of 7; which lie within its assigncd interval [s,, e,], as the roots of the polynomial P(2)= det T, ~-AI. A sequential search method (regula-falsi) together with the Sturm Sequence Propcrty is used for this purpose. Thc corresponding eigenvectors are obtained via Inverse Shifted Iteration. This procedure has two advantages:

(1) no overlapping between the frequency subdomains can occur, and hence none of the eigenvalues is computed twice by two different processors; (2) all processors pcrform their task concurrcntly without any communication. However, if this algorithm wins on simplicity, it suffers from two major inefficiencies:

+

(1) the subdivision of the spcctruni of eigenvalues into N , I subintcrvals of equal length does not always guarantee a load balance among all the processors. In particular: some of the frequency subdomains that arc defined inay be empty and hence may cause the corresponding processor to remain idle. (2) As stated, the algorithm computes all the eigenvalues of T,, while only those eigenvalues of T, which have converged to the eigenvalues of thc original system are really needed. To remove thesc two inefficiencies, a different implementation was developed and described in Refercncc 18. It is based on the mathematical work of Refercnce 20, where a subprogram called ANALZT is designed to be implemented in the inner loop of the Lanczos program. At each step or the Lanczos algorithm, a certain data structure associated with T j ( jd r ) is updated. It includes a few eigenvalues of Ti at the end of the spectrum, and with each eigenvalue is associated an error bound. The number of eigenvalues at the end of the spectrum is variable and depends on the error bound in a very complicated way. Roughly speaking, the eigenvalues whose bounds indicate that they are likely to stagnate in the next two or three steps are included. Since ANALZT delivers intervals where one and only one eigenvalue lies, a parallel search becomes possible. Each dclivered search interval is subdivided into N, + 1 subintervals. There are then N , interior points P ( i ) values which can be calculated in parallel. Out of the N , + 1 interval a particular interval I , = [ j e k , ,& ] i s sought for which P(l,k7)P(&J< 0. At the end of the phase of search for eigenvalues, each processor stores in its local memory all the eigenvalues

1780

C. YARHAT AND E. WILSON

that have converged to the eigenvalues of the original system. If nec,,, is the total number of such eigenvalues, then each processor p is instructed to handle neq,t,”, of these and compute their corresponding eigenvectors via Inverse Shifted Iteration. Thc cvaluation of these eigenvectors is done in parallel without any interprocessor communication. The technicalities of the parallelization of phase 4 may be found in Reference 17. 8.2. Parallelizing [he solulion

of

lhe reduced system

of djmamic equations of equilibrium

8.2.1. Syskrns with proportional dumping--e.xact solution. The reduced system of equations expressed in geometric co-ordinates is transformed to a series of uncoupled generalized SDOF expressed in modal co-ordinates. In this section, the damping matrix is not formcd explicitly; it is rather assumed that damping is proportional to M and K such that a modal damping parameter T i can be assigned to each gcneralized co-ordinate. The resulting uncoupled ith equation of motion in generalized Ritz co-ordinates is given by Yi

+ 2tico! Pi +

(i,:

Y, = x $ J T ~ g ( t )i = 1,2.. . , ,r

where Rg(i) is the product of a spatial vcctor and a time function modelling the loading of the structure. Given two initial conditions, an cxact solution for Yican be carried out. After the Kitz vectors X , have been generated, each processor p j stores in its local mcmory ylevpjpairs ( X & mi), where X Y i is the part of the ith vector XFj corrcsponding lo subdomain D j onto which p j is mapped. The right hand side of the above equation is an inner product that is evaluated concurrently as outlined in Section 8.1. At each time step t,, each processor p i computes first the exact solutions for the Ti corresponding to the modcs it is assigned to. Then, pj starts the process of concurrent evaluation of the final solution U = XI-I’= CiZ; X$)Yi(t).17 After the displacements at t, are computed, each processor p,i(which stores in its local memory all informations relevant to Dj); evaluates concurrently the stresses in subdomain D i before moving to time step t,, . Clearly, the system architecture is such that thc concurrcnt dynamic analysis phase integrates well with the other phases of the overall concurrent analysis.

,

8.2.2. Systems with non-proportional damping step bqs step integration. The non-proportional damping matrix C is determined from damping properties of the various components of the structure Its concurrcnt formation and assembly are similar to those of K . Here we are concerned with a parallel solution of the equilibrium equation ~

+ C U + K U = Fjs, t ) under the restriction that the normal undamped modes, or the mass and stiffness orthogonal W Y D vectors, will not diagonalize the system. An efficient procedure consists in reducing first the system of equation5 using the WYD scheme described earlier. then applying a direct integration scheme such as the central difference method, the Houbolt method, the Wilson 0 method or the Newmark method” to solve the reduced systcm of equations. The formation of M*, C*, K * and F* involves only inner products, which are carried out in parallel without any difficulty. Generally, M*, C*, and K * are almost full so that a storage distribution among the N , processors similar to the one of K , , is adequate: for each of these reduced matrices as well as for F*, the symmetric part of row i is stored in the local memory of processor p = mod ( i - 1, N , ) (Figure 9). The Wilson 0 scheme has been implemented in the concurrent finite element system. However, any of the other direct integration procedures mentioned above can be implemented in the same way: while the nature of the operations differs from one scheme to the other, the data storage strategy and load distribution among the N , processors are unchangcd. Box 5 (see Appendix V)

CONCURRENT COMPUTER PROGRAM ARCHITECTURE

Y

1781

0 I

2

3

Figure 9. Scattered row data structure for K*, M*, C*

describes the parallel implementation of thc Wilson 8 method. K** is factorized in parallel using the algorithm described in Refercnce 7. The parallel forward and backward substitutions used here are the ones described also in Referencc 7. N r , denotes the total number of rows of K * (and hence M* and C*) stored in thc local memory of processor p]. 8.3. A concurrent subspace iteration

The Ritz type load dcpendent solution strategy described above and implemented in a form suitable for MTMD machines applies for the restricted class of dynamic response problems where the load pattern varies only in amplitude. For the solution of more general types of dynamic problem, the method of subspace iteration2’ is more appropriate and has been used in several general purpose structural analysis program^.'^ Clearly, the computations involved in this method (factorization of K , solution of triangular systems, inncr products,. . .) are similar to those of the WYD technique. Hence. they can be ‘parallelized’ in the same manner by taking advantage of the mode of data storage implied by our program architecture. The only difference with the previous strategy i s in thc form of the reduced eigentalue problem which i s no longer tridiagonal. A parallel Jacobi procedure can be applied to solve this Basically, the set of rotations a5sociated with a single sweep is partitioned into rotation subsets so that the M, rotations in a given rotation subset can be carried out concurrently. 9. TREATMENT OF A PRIOR1 NON-LINEAR PROBLEMS

In many material non-linear problems the non-linearity is localized, and separating the linear and nonlinear portions of a structure in thc analysis may be advantageous. Here, it becomes essential to distinguish between a substructure and a subdomain. By subdomain, we mean a collection of elements of the global domain that is assigned to a corresponding processor. A subdomain may hence represent anything from a part of a physical substructure to several substructures. Let D be a given arbitrary finite element domain, and let D(”)be the known subset of D where non-linearities are localized. Note that with these definitions, D(”) may be a set of regions separated by linear substructures, and hence does not have to be a single physical substructure. We shall assume that D‘”’has been somehow marked during the mesh generation phase, so that all its elements can be identified at any time by the software. In our program, these clemcnts are marked with the material type number. We define also D(’)as being the collection of

1782

C. FAKHAT A N D E. WlLSON

Material 1

+

Material 2

Figure 10. Linear-non-linear splittiiig

elements of D which are known to undergo only linear deformations. Hence, we can write: D

= D(IJu D ! ” ~ ( ~ ) I)(’) ; nD(”)= O(*)

(Figure 10)

Note that relation (*) has to bc considered element-wise and not node-wise. The automatic subdivision algorithm is extcnded to pcrform two passes on D. In the first pass, the marked clcments are skippcr and Dc” is split into N , domains. The second pass decomposes D‘”’. As a result, D is subdivided into two sets of N , subdomains each, namely (1) linear ones: 0:’: j = 1,2,.. . .N,; and (2) non-lincar oms: DpJ,k = N , + 1, N , + 2,. . . 2 N , Renumbering is done as explaincd in Scctioii 5, the linear subdomains being treated first. Each processor p j is assigned the pair of subdomains D:), D?jN,. so that load balancing is achieved. The resulting pattern of the stiffhess matrix for the complete domain D is shown with the processor assignment in Figure 11. ~

CONCURRENT COMPUTER PROGRAM ARCHITECTURE

1783

Figure 11. Data structure and processor assignment for linear-non-linear problems

K , , and K,, are respectively the linear and non-linear stiffnesses associated with Dc’)and D‘”’; K,, is the coupling stiffness between thcse two subsets of D. The submatrices appearing in each of K , , and K,,, have thc same phy4cal meaning as those of Section 4, the first relative to subset @IJ, the others relative to D‘”’. K&) is the coupling stiffness between linear subdomain 0;)and non-linear subdomain 0:’. Kj,”)is the coupling stiffnecs between the linear interface of’)binding all Dcl) subdomains and the non-linear Interface I)?) binding all D(”) subdomains.

9.1. Formation qf

t h P stiffims

niutricc-datu

,structure

After D is split, thc host feeds each proccssor with all information relativc to its assigned linear and non-linear subdomains (static, dynamic, linear and non-linear element properties). K , , and K,, are formed and assembled concurrently. Basically, each processor p i is assigned submatrices KyJ(j)K , $ ) ( j ) Kk;J(j), , Kg;)(,j) and rows r j of K[l’ and Kf‘;), where r j is such that p j = mod(r, - I , N p ) . While K,,’s storage is identical to K’s storagc for linear analysis, K,,’s storage is similar to K,,’s. The reason is that K,, will have its arrow pattern destroyed after the linear components of the structure are condensed out. Hence, while processor p j assembles blocks K$’(,j), K$’(,j), it sends each row i to processor p = mod (i - I , N,), if pi # p. K , , is formed and assembled concurrently by the N , processors, processor p i handling {K:“,’. q = 1 + N,, . . . ,2N,,) together with KSJ. It is stored in sparse compacted form, using linked lists. At cach step o f the non-linear analysis, each processor updates concurrently its assigned parts of each of K,,, and K,,, without any communication with other processors. (1) Zie density qf‘the block matri.x K,,, varies with the geometrical position of the non-linear subdomains within the total structure. In some cases, one two or several sub-blocks K t J may be identically null. However, this does not affect significantly the overall load balancing and

I784

C. FARHAT AND E. WILSON

even distribution of data storage, because each processor is still handling an almost equal number of elements. (2) Only the non-zero values of K,, are stored, so that an eventual worst case sparsity resulting from the domain decomposition algorithm, together with the geometrical position of the localized non-linearities in the total structure, does not affect the cost of storage and the computational requirements

9.2. Solution algorithm The overall solution algorithm is depicted in box 6. (see Appendix VI). F,, U,, F , and U , denote respectively the linear and non-linear parts of the force and displacements arrays. K,, is factorized once. At each step of the analysis, K,, is re-updated so that the triangular systems in K,,T, = K,, have to be re-solved, and the multiplications K:,Tk, K;"Tf have to bc re-done. Consequently, the reduced system K:nU, = F: is also rc-updated and re-solved at each iteration. The following details the operations done by each of the N , processors. lf A is an array, A ( j ) denotes the part of A that is stored:

+

(1) in processor pj(or P ~ - ~ , if) ,j # N , + 1 and j # 2 N , 1; (2) among all N , processors, row i in processor mod ( i - 1, N , ) , if j

=N ,

+ 1 or j = 2 N , + 1.

In thc initial calculation phase, processor p,j forms and assembles its assigned loading pair ( F , (j ) : F , ( , j ) ) then factorizes Kat,'(j). The forward and backward substitutions of algorithm CP, as described in Section 6, are used to carry out at each step the solution of the triangular systems in K,,T, = KII,. This is possible because this system has exactly the same data structure as the system K U = F . Thc construction of K:,(j) is a more complex operation. The term [K;,Tk](.j) is the summation

The modc of storage implies that each processor performs concurrently each of the products {K;,T,(j), y = N , + I , 2 N , 1) and send it to processor # q - N, (exccpt for case q = 2 N , 1 where the congruence mapping is used). The product KTnTfis parallelized in the same way. Since K:, is given thc same data structure as K,,, the parallel equation solver described in Reference 7 is applied to obtain the U , responses.

+

+

9.3. State determination

After the system K$,U, = F,* has been solved, the components of Un which are not available in the local memory of processor p J are sent to it by the other processors. Hence, after receipt of the missing non-linear displacements. cach processor concurrcntlq back substitutes for its assigned linear displacements by solving the triangular systems in y-ZY,

t 1

Once processor p J has evaluated the displacements associated with its assigned linear and non-linear subdomains, it computes the ?tresses in D:' and Dy),chccks for convergence and notifies the host about the result. These operations are all done without any interprocessor communication. The host collects all local Convergence flags and issucs a global flag of convergence if all processors have 'converged'. Otherwise, the process is repeated.

CONCURRENT COMPUTER PROGRAM ARCHITECTURE

1785

10. PERFORMANCE OF THE PROPOSED CONCURRENT COMPUTER PROGRAM

In order to demonstrate the validity and assess the efficiency of the ideas developed and summarized in this paper, a sample concurrent computer program was written and implemented on Intel’s iPSC (intel Personal Super Computer r‘), a hypercube multiprocessor. 10.1. The iPSC system

Intel’s iPSC is a 32, 64 or 128 processor parallcl computer. Each processor is an Intel 80286 CPU* with an Intel 80287 floating point accelerator, and has 512 Kbytes of local (non-shared) memory. The interconnection network is a hypercube (or n-dimensional cube, n = 5 6 7 ) . Approximately 275 Kbytes of the 512 Kbytes R A M on each processor are available to the user‘s application, the rest being consumed by the operating system (which is resident on each of the processors). The latter provides a simple message passing interface between the nodes (processors). The built-in timer in the multicomputer records the running time (computation + communication) of each processor, The parallel computation time is the time it takes for the slowest processor to complete its task. The speed-up is defined as T , / T N pwhere , TI is the parallel time elapsed when using only one processor for solving the problem. and TN ,the parallel time when assigning N , processors. The efficiency is the speed-up per processor: E = S N P / N , . 10.2. Linear static analysis 102.1. Direct solution. In Reference 7, it has been shown that for a given problem size, the eficiency is function of thc number of processors N,. Here we illustrate another result. We fix the number of processors to 32 and vary the problem size. The plane stress problem of Figure 2 is used as a basis for the tests. Three meshes with respectively 1700,2550 and 4000 degrees of freedom are studied. The performance of the program obtained on the iPSC with 32 processors is reported in Table I. Tp/Tt0,denotes the ratio of the C P U time (computation + communication) spent in the parallel equation solver module for the solution of the global (or ‘interface’) system of equations, and the total CPU time elapsed in the (CP) scheme. Clearly, a better performance is obtained for a larger problem size. An efficiency rate of 90 per cent is achieved for a 4000 d.0.f. mesh using 32 processors. Two phenomena are responsible for this result.

(1) The communication overhead is concentrated in the parallel equation solver module. If n characterizes the dimension of a subdomain D,(for example its width), the size of K,, grows as Table I. Performance of the (CP) algorithm on a 32-processor hypercube

# of d.0.f.

TKt/Tt,,

1700

78%

2550 4000

70% 397;

Speed up (efficiency)

19.2(hO:/,) 24.0(75%) 28.9(90%)

* A new enhanced member of he iPSC family is the IPSC-VX?where a vector processor is attached to each of the CPlJ’s. However, we had no access to this vector extension of the iPSC at the time this work was developed.

1786

C. FARHAT AND E. WILSON

Table 11. Performance of the iterative algorithm on a 32-processor hppercubc

# of d.0.f.

Speed up (efficiency)

1000 2000 4000 8000 16000

17.7(55"") 21.8(687,,) 24.0(75;,) 26%(83"

Lihat lebih banyak...

A new finite element concurrent computer program architecture

Descrição do Produto

Comentários