Somos uma comunidade de intercâmbio. Por favor, ajude-nos com a subida ** 1 ** um novo documento ou um que queremos baixar:

OU DOWNLOAD IMEDIATAMENTE

International Journal of Computer Mathematics Vol. 89, No. 15, October 2012, 2047–2060

A novel parallel approach for 3D seismological problems Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

Santa Agrestea *, Arrigo Casertab , Angela Ricciardelloa and Vittorio Ruggieroc a Department of Mathematics, Messina University, V.le F. Stagno D’Alcontres, 31 98166 Messina, Italy; b INGV, Via di Vigna Murata 605, 00143 Rome, Italy; c CASPUR, Via dei Tizii 6, 00185 Rome, Italy

(Received 27 July 2011; revised version received 24 October 2011; second revision received 12 December 2011; third revision received 2 February 2012; fourth revision received 5 April 2012; accepted 28 May 2012) The large spatial scale associated with the modelling of strong ground motion in three dimensions requires enormous computational resources. For this reason, the simulation of soil shaking requires highperformance computing. The aim of this work is to present a new parallel approach for these kind of problems based on domain decomposition technique. The main idea is to subdivide the original problem into local ones. It allows to investigate large-scale problems that cannot be solved by a serial code. The performance of our parallel algorithm has been examined analysing computational times, speed-up and efficiency. Results of this approach are shown and discussed. Keywords: seismic wave propagation; domain decomposition; finite element method; parallel computing 2010 AMS Subject Classifications: 65Y05; 68W10; 65M55; 86A15

1.

Introduction

During the last 20 years, the interaction between near-surface geological heterogeneities and seismic wave field has been widely studied and a considerable amount of work has been done as far as numerical models of the soil shaking is concerned [3,4]. The large spatial scale of this problem, associated with the complexity of modelling strong ground motion in a three-dimensional (3D) domain, suggests adopting high-performance computing. The use of parallel computing with optimization techniques allows us to make more realistic numerical simulations of the soil shaking features. This cannot be done by using serial computing because gigabytes of memory, performance at gigaflops rate, days of computational time to simulate a minute of soil shaking, to mention a few, are needed. In the literature, most of the authors suggest converting serial codes into parallel one in order to obtain more powerful tools [2,6,15]. On the contrary, in this work, we present an algorithmic approach developed directly as a parallel one, where spatial discretization is based on finite element method (FEM) on non-structured meshes, while Newmark method has been adopted for time integration [7,14]. The main idea of our method is to split the original problem into subproblems with a smaller size, namely local problems. Then, domain decomposition techniques *Corresponding author. Email: [email protected]

ISSN 0020-7160 print/ISSN 1029-0265 online © 2012 Taylor & Francis http://dx.doi.org/10.1080/00207160.2012.700052 http://www.tandfonline.com

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2048

S. Agreste et al.

have been applied, paying attention to both the load balancing problem and nodes shared by multiple processors problems. In particular, the latter are the so-called overlapping nodes. Each processor solves a local problem, and the global solution is obtained combining the local ones at each time step. The program, implemented by Message Passing Interface (MPI) library, runs on the distributed memory cluster named MATRIX at CASPUR computer centre.1 In the algorithm implementation, a significant support is given by two efficient tools available in the literature: LaGrit [13] and MeTiS [12]. The former is to construct the unstructured mesh and the latter is to fix subdomains once the mesh has been already generated. In order to estimate how work load is well distributed, according to the domain decomposition, we have introduced two new parameters, closing nodes and node balancing [1]. Results of a wide experimentation highlight the good quality of load balancing. Moreover, we tried out the performance of our algorithm through the time scheduling, the speed-up and the efficiency. Summarizing, we developed a code that is portable, readable and computationally efficient.

2.

Seismic model

We are interested in numerically modelling and simulating the 3D interaction between the seismic wave field and the near-surface geological structures. This is because such interaction is the main cause of the damage pattern observed during earthquakes. The aforementioned near-surface geological structures can be seen as an elastic homogeneous medium. In this section, we define an initial boundary value problem (IBVP in the following) for such seismological problem. Let R3 be the 3D real space and x = (x, y, z) ∈ R3 a generic vector; let f (x, y) be a regular function in R2 so that a heterogeneous viscoelastic medium ! can be mathematically defined as follows: ! = {(x, y, z) ∈ R3 : z > f (x, y), (x, y) ∈ R2 } whose boundary ∂! = {(x, y, z) ∈ R3 : z = f (x, y), (x, y) ∈ R2 } describes the so-called free surface (the topography of the site under study in seismology) of !. Then, the elastic wave propagation problem through ! is described by the following: ∂ 2u = ∇ · ((λ(x) + 2µ(x))∇u) − ∇ · (µ(x)∇u) + F(x, t) ∀(x, t) ∈ ! × (0, +∞), ∂t 2 u(x, 0) = 0 ∀x ∈ !,

ρ(x)

∂ u(x, 0) = 0 ∂t ∂ σ(x, t) = 0 ∂n

∀x ∈ !,

(1)

∀(x, t) ∈ ∂! × (0, +∞)

where u is the displacement field, the functions ρ(x) (mass density), λ(x) and µ(x) (the Lamé constants) characterize the elastic properties of the medium and F(x, t) is the sum of external forces (the seismic source in seismology). The ∇ · ((λ(x) + 2µ(x))∇u) − ∇ · (µ(x)∇u) represents the inner forces that are the result of strain (ϵ(x) = 21 (∇u + ∇uT )) and stress (σ(x, t) = λ(x)∇ · uI + 2µ(x)ϵ(u)) as it results from Hooke’s law. The afore-described model is a simplification of the one presented in [5]. The system (1) has been numerically integrated by means of FEM on non-structured meshes.

International Journal of Computer Mathematics

2049

The mesh step dh depends on both the maximum frequency fmax of the frequency band belonging to the incoming wave-field and the minimum wave velocity vmin in ! as follows

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

dh =

vmin , nfmax

where n is the number of sampling per wavelength needed to cover the minimum frequency for a non-dispersive propagation [24]. In order to minimize the reflections at the boundary of the model, we introduced the so-called absorbing boundary conditions [8–11,23]. Because our aim is to numerically simulate an infinity half-space, such conditions are designed to avoid spurious reflections from the boundaries of ! that have any physical meaning. In this work, we assumed a generalization of Caserta–Firmani–Ruggiero transparent conditions [19] described in the following. The absorbing boundary conditions we adopted are based on a paraxial approximation. As a consequence, their efficiency depends on the angle of incidence of the incoming wave field on the ‘transparent boundary’. To be more specific, rewriting our IBVP (1) in its general form u¨ −

3 !

h,k=1

E hk u,hk = f,

(2)

the differential operator E in Equation (2) can be written by means of the following matrices: ⎞ ⎛ 2 ⎛ ⎞ vP 0 0 0 vP2 − vS2 0 0 0⎠ , E 11 = ⎝ 0 vS2 0 ⎠ , E 12 = ⎝vP2 − vS2 2 0 0 vS 0 0 0 ⎛ 2 ⎞ ⎛ ⎞ 2 vS 0 0 0 0 vP − vS2 0 0 ⎠, E 22 = ⎝ 0 vP2 0 ⎠ , E 13 = ⎝ 0 0 0 vS2 vP2 − vS2 0 0 ⎛ 2 ⎞ ⎛ ⎞ vS 0 0 0 0 0 0 vP2 − vS2 ⎠ , E 33 = ⎝ 0 vS2 0 ⎠ , E 23 = ⎝0 2 2 2 0 0 vP 0 vP − v S 0 √ where vP = √ (λ + 2µ)/ρ is the velocity of the longitudinal elastic waves (P-wave in seismology) and vS = µ/ρ is the velocity of the shear elastic waves (S-wave in seismology). The operator E in Equation (2) can be approximated by three matrices, A, B and C as follows: & ' ∂ ∂ ∂ 2 A +B +C ≃ E. ∂x ∂y ∂z In order to mathematically define the transparent boundary conditions, at the boundary we can chose A, B and C as follows: E 11 ≃ A2 ,

E 22 ≃ B2 ,

E 12 ≃ AB + BA,

E 23 ≃ BC + CB,

E 13 ≃ AC + CA, E 33 ≃ C 2 .

With the previous choice, the weak formulation with the Galerkin method [17] of the IBVP (2) is d2 d U + M1 U + M0 U = f, (3) dt 2 dt where M2 , M1 and M0 are sparse square matrices whose order is three times the number of nodes. Their entries are the integral on ! of a combination of the basis functions N(x) or their derivatives M2

2050

S. Agreste et al.

(see [16,18] for details). As far as the time discretization is concerned, we applied the Newmark method with a time step equal to δt ≤ dh/vmax in order to preserve the causality principle so that the discretized form of Equation (3) becomes n+1

¨ AU

= bn+1

˙ n+1 = U ¨ n+1 ) ˙ n + δt((1 − γ2 )U ¨ n + γ2 U U

(4)

2

with δt 2 A = M2 + γ2 δtM1 + γ1 M0 , 2 ( ) δt 2 n+1 n+1 ¨ n − (M1 + δtM0 )U ˙ n − M 0 Un . b =f − (1 − γ2 )M1 + (1 − γ1 )M0 U 2 A is a sparse diagonally dominant matrix almost everywhere and its pattern is symmetric with entries like 3 × 3 blocks. For the sake of readability and clarity, we simulate the propagation of an elastic wave field in a homogeneous domain: a cube with L = 1000 m. The spatial component of the forcing term is located in a point that is in the middle of the domain so that it is represented by a delta function δ(x, y, z). On the contrary, the temporal component of the forcing term is represented by the so-called Gabor impulse, G(t) (Figure 1). Its mathematical form is * & ' + ωp (t − ts ) 2 G(t) = exp − cos[ωp (t − ts ) + ψ], (5) γ 0.2

0.15

0.1

0.05 Amplitude

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

˙ n + δt ((1 − γ1 )U ¨ n + γ1 U ¨ n+1 ) Un+1 = Un + δt U 2

0

−0.05

−0.1

−0.15

−0.2

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 1. The Gabor impulse with parameters fp = 0.45, ψ = (π/2), γ = 0.066, ωp = 2πfp and ts = 0.45(γ /fp ).

2051

International Journal of Computer Mathematics −3

x 10

2

1.5

1

Acceleration

0

−0.5

−1

−1.5

−2

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 2. The acceleration trend recorded in the source point P(500 500 500), in a homogeneous domain, that is a cube with L = 1000 m, vP = 2100 m/s, vS = 1200 m/s, fmax = 12 Hz and Nnode = 64,000.

−5

6

x 10

4

2

0

Velocity

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

0.5

−2

−4

−6

−8

−10

−12

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 3. The velocity trend recorded in the source point P(500 500 500), in a homogeneous domain, that is a cube with L = 1000 m, vP = 2100 m/s, vS = 1200 m/s, fmax = 12 Hz and Nnode = 64,000.

2052

S. Agreste et al. −6

8

x 10

6

4

Displacement

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2

0

−2

−4

−6

−8

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 4. The displacement trend recorded in the source point P(500 500 500), in a homogeneous domain, that is a cube with L = 1000 m, vP = 2100 m/s, vS = 1200 m/s, fmax = 12 Hz and Nnode = 64,000.

where fp is the predominant frequency of the impulse, ωp = 2πfp , ts = 0.45γ /fp , and γ and ψ are the parameters that must be chosen according to the width of the frequency band of the incoming wave field. In other words, our forcing term is not a monochromatic wave field but is made by a set of waves belonging to a frequency band. We depict the evolution of acceleration in Figure 2, of velocity in Figure 3 and of displacement in Figure 4, evaluated in the source point P. In this case, the P-wave velocity is vP = 2100 m/s, the S-wave velocity is vS = 1200 m/s, the maximum frequency is fmax = 12 Hz and the number of grid nodes Nnode = 64,000; with these conditions, the minimum edge is edgemin = 25.00 m, whereas the time step is δt = 4.03e − 004 s and the simulated seconds are 1.6 s.

3.

Parallel algorithm

In developing the parallel code, we have adopted two open-source software packages, LaGrit [13] and MeTiS [12]. In detail, the discretization of the physical domain has been obtained using LaGrit, whose mesh generation uses a Delaunay tetrahedralization algorithm [21,22]. LaGrit is a library realized by Los Alamos National Security, LLC at Los Alamos National Laboratory (LANL) with the US Department of Energy (DOE)2 . In order to distribute the domain among processors, we need to partition the mesh, and for that purpose, we used MeTiS. Its advantage is to have a partitioning algorithm able to minimize the connections among processors by minimizing the edge cut. The parallel algorithm has been implemented by means of MPI library, on the CASPUR cluster, MATRIX.

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

International Journal of Computer Mathematics

2053

Figure 5. MeTiS distribution of 91,125 nodes and 453,655 elements among 10 subdomains. Black points represent the overlapping nodes.

3.1

Domain decomposition

To the purpose of realizing a parallel algorithm, a suitable analysis of mesh partitioning and data parcelling has been achieved. 3.1.1 Mesh partitioning Once LaGrit has generated the mesh, MeTiS takes care in partitioning into subdomains the set of the mesh elements. In the case study presented in the previous section, the numerical grid obtained by LaGrit has 91,125 nodes connected by 453,655 tetrahedra. The Delaunay tetrahedralization algorithm has been applied to our domain. The result of the MeTiS partition with 10 subdomains is shown in Figure 5. Even though that the sets of tetrahedra and nodes provided by MeTiS are disjoint, the subdomains share some nodes anyway. Such nodes are called overlapping nodes. It is crucial to be able to identify them, that is why we implemented an ad hoc procedure. In Figure 5, the overlapping nodes are pointed out by black points. The mesh partitioning is designed in order to both minimize the communication time and care about the load balancing. 3.1.2

Parcelling

Once subdomains are realized, the next step is to distribute such subdomains among processors and this is made by one-to-one mapping between subdomains and processors. In such a framework, it is crucial to be able to identify the coupling between each node and its associated processor. This is realized by generating a global-to-local map [2,20]. When subdomains are distributed on np processors, each subdomain can generate its local mass matrix, A(j) , j = 1, . . . , np, that is sparse, with a symmetric pattern. In fact, the symmetric entries correspond to inner nodes and others correspond to nodes lying on the numerical faces, called boundary nodes. For this reason, we tidy up the inner nodes and then, in queue, the boundary ones. As a consequence, we can store only less than half of the entries of the mass matrices, by the Compact Sparse Row (CSR) format optimized for such kind of matrix, that is, sparse, with a symmetric pattern and suitably reordered matrices.

S. Agreste et al.

The assemblage of local mass matrices is achieved in parallel without communications among processors. The advantage of this approach is that the global mass matrix will be never composed in its explicit form. Some results on the load balancing associated with mesh partitioning are presented in [1], where the authors introduce two new parameters, namely closing nodes β and node balancing α. The former is linked with communications among processors that are evaluated in terms of the overlapping nodes of the parameter called closing nodes. The latter gives information on the data-load distribution. The index β represents the ratio between the total number of overlapping nodes NO and the total number of nodes N. By means of this parameter, we can estimate the growth of the matrix size caused by the overlapping nodes. The dimension of the local mass matrix is Nnode loc = (N/np)(1 + β), where np represents the number of processors. In Figure 6, we present the trend of the parameter β versus np, from 4 to 48. In particular, the cases N = 42,875; 54,872; 64,000; 74,088; 91,125 are shown. The number of overlapping nodes increases with the number of processors. However, if N raises then NO is growing steady. By the way of explanation, we set np = 40. If N = 42,875, then the number of overlapping nodes, with their multiplicity, weights upon number of nodes nearly 35% of N. Otherwise, if N = 91,125, then it weights upon nearly 26%. Node balancing α is the ratio between the sum of nodes of each processor Nt and the total number of nodes times the number of processors, N × np. To the purpose of investigating α, we considered N and np as above. In Figure 7, we present the trend of the parameter α. Then, increasing the number of nodes, α approaches to the ideal case (α = 1/np) represented by red star points. It means that work load is well distributed among processors.

0.4 42875 NODES 54872 NODES 64000 NODES 74088 NODES 91125 NODES

0.35

0.3

0.25

β

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2054

0.2

0.15

0.1

0.05

0

5

10

15

20

25

30

35

40

np

Figure 6.

Closing nodes versus number of processors. In the legend is the number of total nodes.

45

50

2055

International Journal of Computer Mathematics 0.35

42875 NODES 54872 NODES 64000 NODES 74088 NODES 91125 NODES

0.3

0.2

α

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

0.25

0.15

0.1

0.05

0

Figure 7.

3.2

0

5

10

15

20

25 np

30

35

40

45

50

Node balancing versus number of processors.

Solver phase

Once local mass matrices are stored, each processor solves its own linear system, that is, Equation (4) reduced to its own subdomain, at each time step n, that is, A(j) U¨ (j) n = bn ,

j = 1, . . . , np.

(6)

To the purpose of increasing the Newmark stability, we adopted γ1 = 41 and γ2 = 21 [7,14]. The obtained mass matrices are sparse, diagonally dominant a.e. and with a symmetric pattern. For this reason, the solver is based on the Gauss–Seidel method, implemented for the optimized CSR matrix storage [16]. Solving the system (6) by the Gauss–Seidel method, at time n, means finding the vector of the sequence indexed by k , , b(j) (i) − li A(j) (i, l)U¨ k−1 (l − 1) U¨ (j)k (i) = A(j) (i, i) ∀j = 1, . . . , np

such that

or rather for fixed tolerances tol1 and tol2 .

and

∀i = 1, . . . , Nnode loc

∥U¨ (j)k − U¨ (j)k−1 ∥ < tol1 ∥U¨ (j)k ∥ ∥A(j) U¨ k − b(j) ∥ < tol2

2056

S. Agreste et al.

Therefore, at each time step n, we have np local solutions U¨ (j) that are Nnode loc size vectors. The local solutions are assembled with the aim of reconstructing the global one by means of the global-to-local map, that is, map ¨ (j) U¨ (j) (ilocal ) U (iglobal ) * −→

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

∀j = 1, . . . , np

and

∀ilocal = 1, . . . , Nnode loc .

Thus, we obtain np vectors with size Nnode , whose non-zeros entries are only the corresponding U¨ (j) (ilocal ) elements, with ilocal = 1, . . . , Nnode loc . Note that on the overlapping nodes, we have as local solutions as their multiplicity, that is, the number of subdomains sharing them. For this reason, we applied a weighted mean, where the weights r (j) (i) are proportional to the node multiplicity in each subdomains. ¨ U(i) =

np ! j=1

U¨ (j) (i)r (j) (i) ∀i = 1, . . . , Nnode .

Because the local convergence does not ensure the overall one, there is a need to carry out the control of global norm at each time step n, that is, - n -∥U¨ ∥ − max ∥U¨ (j)n ∥- < tol j

with tol formerly fixed. It has been numerically proved that our choice of time step and space step guarantees both local and global convergence for the system (6). To summarize, at each time step, processors swap the results on overlapping nodes to obtain the weighted solution. Its norm is computed once the global solution has been reconstructed. All processors update their solution if the norm is less than a fixed tolerance.

4.

Numerical results

In this section, we present the performance of our algorithm. Numerical tests have been realized on a distributed memory cluster called MATRIX at CASPUR computer centre. It consists of 320 nodes, divided into front-end nodes, free interactive nodes and number-crunching nodes. All of them are two-way quad-core Opteron 2.1 GHz with 16 GB of RAM, except for the numbercrunching nodes which are equipped with 32 GB of RAM. MPI library has been adopted because it accommodates the partitioning of the problem. In order to evaluate the behaviour of our parallelization, we considered the computational times, the speed-up and the efficiency parameters. Computational times. In order to evaluate the performance of our parallel algorithm, we measured and analysed the runtime phase by means of a software tool, namely Scalasca [25]. In total, 10,000 time iterations have been executed with 42,875 nodes and 91,125 nodes by 8, 24, 48 cores. In Table 1, we report the percentage of the time spent in each simulation and distinguish three different categories. Type MPI refers to function calls to the MPI library. User program routines on paths that directly or indirectly call MPI provide valuable context for understanding the communication and synchronization behaviour of the parallel execution and are distinguished with the COM type from other routines that are involved with purely local computation marked USR. In Table 1, we can note that the MPI time increases with a great number of cores. In particular, the MPI time exceeds the 50% of the total time using 48 cores with N = 42,875. An improvement

2057

International Journal of Computer Mathematics Table 1. The percentage of time spent during the simulations when the mesh has 42,875 and 91,125 nodes. Category

8 PROC

24 PROC

48 PROC

42,875

MPI COM USR

4.97 0.77 94.26

20.23 0.82 78.92

51.13 0.80 47.26

91,125

MPI COM USR

6.36 1.04 92.60

13.38 0.90 85.72

32.90 0.71 66.07

of parallel performance is obtained by increasing the number of nodes. This slightly hinder the parallel scalability of the algorithm as widely discussed afterwards. Speed-up. S(n) is a useful index to check any gain in the performance with respect to the serial code. It is defined as the ratio between real time employed by one processor to run the code, T (1), and the real time needed by n processors, T (n), namely S(n) =

T (1) . T (n)

Note that the time T (1) is around 10,120 s with 42,875 nodes and is around 23,626 s with 91,125 nodes. For our modelling, the speed-up scaling is satisfactory up to 40 processors (Figure 8). Beyond this threshold, the saturation process takes place. This means that the communication time is near Speed up 50 Ideal case 233512 elem 300940 elem 345445 elem 413526 elem 453655 elem

45 40 35 30 25

Sp

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

Node number

20 15 10 5 0

Figure 8.

0

5

Speed-up.

10

15

20 25 30 Number of processors

35

40

45

50

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2058

S. Agreste et al.

Figure 9.

Efficiency.

to the computational time needed by each processor to execute its work load. In other words, it means that we lost in communications the time gained increasing the number of processors. This because the speed-up depends both on the number of processor and on the domain decomposition. Efficiency. E(n) is defined by the ratio between the speed-up and the number of processors, namely S(n) E(n) = . n It is a measure of the relevance of the relative utilization of the resources of a parallel computer. The efficiency trend in Figure 9 shows that in our case study 30% of the running time is wasted only in communication and synchronization. During the rest of the time, processors run in parallel. At each time step, the processors swap information on overlapping nodes and this step is fundamental to synchronize the local solutions for obtaining the global one. Analysing results, it is clear that the speed-up saturation starts when the number of processors is near to 40. It means that we lost in communication the time gained increasing the number of processors. Then, the speed-up depends both on the number of processors and on the domain decomposition.

5.

Conclusion

In this work, we studied an IBVP describing the dynamics of interactions between seismic wave field and near-surface geological structures. A novel approach has been presented to solve this problem in parallel. By means of the domain decomposition technique, the original problem has been split into local problems. This subdivision allowed us to investigate those large-scale

International Journal of Computer Mathematics

2059

problems that a serial code is not able to solve. Closing nodes and node balancing have been introduced and analysed in order to stress the performance of our domain decomposition scheme. Speed-up and efficiency support and confirm such performance. Our numerical code is portable, readable and computationally efficient, which is designed and tuned for, but not limited to, seismological applications able to simulate a 3D seismological problem with a large scale.

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

Acknowledgements This work has been supported by GNCS–INdAM and Research Grants of the University of Messina. We are very grateful to Bruno Firmani (University of Trento), Piero Lanucara (CASPUR, Rome) and Luigia Puccio (University of Messina) for their help and criticism on mathematical as well as numerical aspects of the paper.

Notes 1. http://www.caspur.it. 2. http://lagrit.lanl.gov/.

References [1] S. Agreste and A. Ricciardello, Simulation of seismic wave propagation in 3d heterogeneous media: A parallel computing approach, Commun. Appl. Ind. Math. 1 (2011), pp. 1–14, doi:10.1685/journal.caim. [2] H. Bao, J. Bielak, O. Ghattas, L. Kallivokas, D. O’Hallaron, J. Shewchuk, and J. Xu, Earthquake ground motion modeling on parallel computers, Proceedings of Supercomputing ’96, Pittsburgh, PA, USA, 17–22 November, 1996, pp. 1–19. [3] P. Bard, Effects of surface geology on ground motion: Recent results and remaining issues, Proceedings of the 10th European Conference on Earthquake Engineering, Vienna, 1995, pp. 305–324. [4] J. Bielak, O. Ghattas, and H. Bao, Ground motion modeling using 3d finite element methods, Proceedings of the 2nd International Symposium on the Effects of Surface Geology on the Seismic Motion, Yokohama, 1998, pp. 121–133. [5] A. Caserta, A time domain finite-difference technique for oblique incidence of antiplane waves in heterogeneous dissipative media, Ann. Geophys. 41 (1998), pp. 617–631. [6] A. Caserta, V. Ruggiero, M. Busico, and I. Opršal, Parallelisation technique for serial 3d seismic codes: Sms approach, Ann. Geophys. 52 (2009), pp. 503–514. [7] E. Chaljub, D. Komatitsch,Y. Capdeville, J. Vilotte, B. Valette, and G. Festa, Spectral-element analysis in seismology, Adv. Geophys. 48 (2007), pp. 365–419. [8] R. Clayton and B. Engquist, Absorbing boundary conditions for acoustic and elastic wave equations, Bull. Seismol. Soc. Amer. 67 (1977), pp. 1529–1540. [9] R. Clayton and B. Engquist, Absorbing boundary conditions for wave-equation migration, Geophysics 45 (1980), pp. 895–904. [10] B. Engquist and A. Majda, Absorbing boundary conditions for the numerical simulation of waves, Math. Comp. 31 (1977), pp. 629–651. [11] B. Engquist and A. Majda, Radiation boundary conditions for acoustic and elastic wave calculations, Comm. Pure Appl. Math. 32 (1979), pp. 313–357. [12] G. Karypis and V. Kumar, MeTiS, A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices – Version 4.0, Army HPC Research Center, Department of Computer Science, University of Minnesota, 1998. Available at http://www.cs.umn.edu/∼karypis. [13] L.A.N. Laboratory, Los Alamos Grid Toolbox, LaGrit, Los Alamos National Laboratory, 2003. Available at http://lagrit.lanl.gov/. [14] N. Newmark, A method of computation for structural dynamics, ASCE J. Eng. Mech. Div. 85 (1959), pp. 67–94. [15] K. Olsen and R. Archuleta, Three-dimensional simulations of earthquakes on the Los Angeles fault system, Bull. Seismol. Soc. Amer. 86 (1996), pp. 575–596. [16] L. Puccio and A. Ricciardello, Algorithms for large sparse matrices based on finite element method, Commun. SIMAI Congr. 3 (2009), pp. 1–11. [17] A. Quarteroni, A. Tagliani, and E. Zampieri, Generalized Galerkin approximation of elastic waves with absorbing boundary conditions, Comput. Methods Appl. Mech. Eng. 163 (1998), pp. 323–341. [18] A. Ricciardello, Finite element method for seismological applications, Ph.D. thesis, Mathematics Department, Messina University, 2011. [19] V. Ruggiero, P. Lanucara, M.P. Busico, A. Caserta, and B. Firmani, Numerical modelling of the ground motion: A parallel approach for finite element methods, in Applied and Industrial Mathematics in Italy: Proceedings of 7th

2060

[20]

[21]

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

[22] [23] [24] [25]

S. Agreste et al. Conference, Venice, Italy, M. Primicerio, R. Spigler, and V. Valente eds., Chap. 1, World Scientific, Singapore, 2004, pp. 487–495. E. Schwabe, G. Blelloch, A. Feldmann, O. Ghattas, J. Gilbert, G. Miller, D. O’Hallaron, J. Shewchuk, and S. Teng, A separator-based framework for automated partitioning and mapping of parallel algorithms for numerical solution of PDEs, Proceedings of 1992 DAGS/PC Symposium, Department of Mathematics and Computer Science, Dartmouth College, Hanover, NH, 23–27 June, 1992, pp. 48–62. J. Shewchuk, A condition guaranteeing the existence of higher-dimensional constrained Delaunay triangulations, SCG ’98: Proceedings of the Fourteenth Annual Symposium on Computational Geometry, Minneapolis, MN, USA, ACM Press, New York, June 7–10, 1998, pp. 76–85. J. Shewchuk, Tetrahedral mesh generation by Delaunay refinement, SCG ’98: Proceedings of the Fourteenth Annual Symposium on Computational Geometry, Minneapolis, MN, USA, ACM Press, New York, June 7–10, 1998, pp. 86–95. R. Stacey, Improved transparent boundary formulations for the elastic-wave equation, Bull. Seismol. Soc. Amer. 78 (1988), pp. 2089–2097. T. Tu, H. Yu, L. Ramirez-Guzman, J. Bielak, O. Ghattas, K.L. Ma, and D. O’Hallaronh, From mesh generation to scientific visualization an end-to-end approach to parallel supercomputing, SC2006, Tampa, FL, 2006. F. Wolf, Scalasca, Helmholtz-University Young Investigators Group, Tennessee University, 2010. Available at http://www.scalasca.org.

Lihat lebih banyak...
A novel parallel approach for 3D seismological problems Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

Santa Agrestea *, Arrigo Casertab , Angela Ricciardelloa and Vittorio Ruggieroc a Department of Mathematics, Messina University, V.le F. Stagno D’Alcontres, 31 98166 Messina, Italy; b INGV, Via di Vigna Murata 605, 00143 Rome, Italy; c CASPUR, Via dei Tizii 6, 00185 Rome, Italy

(Received 27 July 2011; revised version received 24 October 2011; second revision received 12 December 2011; third revision received 2 February 2012; fourth revision received 5 April 2012; accepted 28 May 2012) The large spatial scale associated with the modelling of strong ground motion in three dimensions requires enormous computational resources. For this reason, the simulation of soil shaking requires highperformance computing. The aim of this work is to present a new parallel approach for these kind of problems based on domain decomposition technique. The main idea is to subdivide the original problem into local ones. It allows to investigate large-scale problems that cannot be solved by a serial code. The performance of our parallel algorithm has been examined analysing computational times, speed-up and efficiency. Results of this approach are shown and discussed. Keywords: seismic wave propagation; domain decomposition; finite element method; parallel computing 2010 AMS Subject Classifications: 65Y05; 68W10; 65M55; 86A15

1.

Introduction

During the last 20 years, the interaction between near-surface geological heterogeneities and seismic wave field has been widely studied and a considerable amount of work has been done as far as numerical models of the soil shaking is concerned [3,4]. The large spatial scale of this problem, associated with the complexity of modelling strong ground motion in a three-dimensional (3D) domain, suggests adopting high-performance computing. The use of parallel computing with optimization techniques allows us to make more realistic numerical simulations of the soil shaking features. This cannot be done by using serial computing because gigabytes of memory, performance at gigaflops rate, days of computational time to simulate a minute of soil shaking, to mention a few, are needed. In the literature, most of the authors suggest converting serial codes into parallel one in order to obtain more powerful tools [2,6,15]. On the contrary, in this work, we present an algorithmic approach developed directly as a parallel one, where spatial discretization is based on finite element method (FEM) on non-structured meshes, while Newmark method has been adopted for time integration [7,14]. The main idea of our method is to split the original problem into subproblems with a smaller size, namely local problems. Then, domain decomposition techniques *Corresponding author. Email: [email protected]

ISSN 0020-7160 print/ISSN 1029-0265 online © 2012 Taylor & Francis http://dx.doi.org/10.1080/00207160.2012.700052 http://www.tandfonline.com

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2048

S. Agreste et al.

have been applied, paying attention to both the load balancing problem and nodes shared by multiple processors problems. In particular, the latter are the so-called overlapping nodes. Each processor solves a local problem, and the global solution is obtained combining the local ones at each time step. The program, implemented by Message Passing Interface (MPI) library, runs on the distributed memory cluster named MATRIX at CASPUR computer centre.1 In the algorithm implementation, a significant support is given by two efficient tools available in the literature: LaGrit [13] and MeTiS [12]. The former is to construct the unstructured mesh and the latter is to fix subdomains once the mesh has been already generated. In order to estimate how work load is well distributed, according to the domain decomposition, we have introduced two new parameters, closing nodes and node balancing [1]. Results of a wide experimentation highlight the good quality of load balancing. Moreover, we tried out the performance of our algorithm through the time scheduling, the speed-up and the efficiency. Summarizing, we developed a code that is portable, readable and computationally efficient.

2.

Seismic model

We are interested in numerically modelling and simulating the 3D interaction between the seismic wave field and the near-surface geological structures. This is because such interaction is the main cause of the damage pattern observed during earthquakes. The aforementioned near-surface geological structures can be seen as an elastic homogeneous medium. In this section, we define an initial boundary value problem (IBVP in the following) for such seismological problem. Let R3 be the 3D real space and x = (x, y, z) ∈ R3 a generic vector; let f (x, y) be a regular function in R2 so that a heterogeneous viscoelastic medium ! can be mathematically defined as follows: ! = {(x, y, z) ∈ R3 : z > f (x, y), (x, y) ∈ R2 } whose boundary ∂! = {(x, y, z) ∈ R3 : z = f (x, y), (x, y) ∈ R2 } describes the so-called free surface (the topography of the site under study in seismology) of !. Then, the elastic wave propagation problem through ! is described by the following: ∂ 2u = ∇ · ((λ(x) + 2µ(x))∇u) − ∇ · (µ(x)∇u) + F(x, t) ∀(x, t) ∈ ! × (0, +∞), ∂t 2 u(x, 0) = 0 ∀x ∈ !,

ρ(x)

∂ u(x, 0) = 0 ∂t ∂ σ(x, t) = 0 ∂n

∀x ∈ !,

(1)

∀(x, t) ∈ ∂! × (0, +∞)

where u is the displacement field, the functions ρ(x) (mass density), λ(x) and µ(x) (the Lamé constants) characterize the elastic properties of the medium and F(x, t) is the sum of external forces (the seismic source in seismology). The ∇ · ((λ(x) + 2µ(x))∇u) − ∇ · (µ(x)∇u) represents the inner forces that are the result of strain (ϵ(x) = 21 (∇u + ∇uT )) and stress (σ(x, t) = λ(x)∇ · uI + 2µ(x)ϵ(u)) as it results from Hooke’s law. The afore-described model is a simplification of the one presented in [5]. The system (1) has been numerically integrated by means of FEM on non-structured meshes.

International Journal of Computer Mathematics

2049

The mesh step dh depends on both the maximum frequency fmax of the frequency band belonging to the incoming wave-field and the minimum wave velocity vmin in ! as follows

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

dh =

vmin , nfmax

where n is the number of sampling per wavelength needed to cover the minimum frequency for a non-dispersive propagation [24]. In order to minimize the reflections at the boundary of the model, we introduced the so-called absorbing boundary conditions [8–11,23]. Because our aim is to numerically simulate an infinity half-space, such conditions are designed to avoid spurious reflections from the boundaries of ! that have any physical meaning. In this work, we assumed a generalization of Caserta–Firmani–Ruggiero transparent conditions [19] described in the following. The absorbing boundary conditions we adopted are based on a paraxial approximation. As a consequence, their efficiency depends on the angle of incidence of the incoming wave field on the ‘transparent boundary’. To be more specific, rewriting our IBVP (1) in its general form u¨ −

3 !

h,k=1

E hk u,hk = f,

(2)

the differential operator E in Equation (2) can be written by means of the following matrices: ⎞ ⎛ 2 ⎛ ⎞ vP 0 0 0 vP2 − vS2 0 0 0⎠ , E 11 = ⎝ 0 vS2 0 ⎠ , E 12 = ⎝vP2 − vS2 2 0 0 vS 0 0 0 ⎛ 2 ⎞ ⎛ ⎞ 2 vS 0 0 0 0 vP − vS2 0 0 ⎠, E 22 = ⎝ 0 vP2 0 ⎠ , E 13 = ⎝ 0 0 0 vS2 vP2 − vS2 0 0 ⎛ 2 ⎞ ⎛ ⎞ vS 0 0 0 0 0 0 vP2 − vS2 ⎠ , E 33 = ⎝ 0 vS2 0 ⎠ , E 23 = ⎝0 2 2 2 0 0 vP 0 vP − v S 0 √ where vP = √ (λ + 2µ)/ρ is the velocity of the longitudinal elastic waves (P-wave in seismology) and vS = µ/ρ is the velocity of the shear elastic waves (S-wave in seismology). The operator E in Equation (2) can be approximated by three matrices, A, B and C as follows: & ' ∂ ∂ ∂ 2 A +B +C ≃ E. ∂x ∂y ∂z In order to mathematically define the transparent boundary conditions, at the boundary we can chose A, B and C as follows: E 11 ≃ A2 ,

E 22 ≃ B2 ,

E 12 ≃ AB + BA,

E 23 ≃ BC + CB,

E 13 ≃ AC + CA, E 33 ≃ C 2 .

With the previous choice, the weak formulation with the Galerkin method [17] of the IBVP (2) is d2 d U + M1 U + M0 U = f, (3) dt 2 dt where M2 , M1 and M0 are sparse square matrices whose order is three times the number of nodes. Their entries are the integral on ! of a combination of the basis functions N(x) or their derivatives M2

2050

S. Agreste et al.

(see [16,18] for details). As far as the time discretization is concerned, we applied the Newmark method with a time step equal to δt ≤ dh/vmax in order to preserve the causality principle so that the discretized form of Equation (3) becomes n+1

¨ AU

= bn+1

˙ n+1 = U ¨ n+1 ) ˙ n + δt((1 − γ2 )U ¨ n + γ2 U U

(4)

2

with δt 2 A = M2 + γ2 δtM1 + γ1 M0 , 2 ( ) δt 2 n+1 n+1 ¨ n − (M1 + δtM0 )U ˙ n − M 0 Un . b =f − (1 − γ2 )M1 + (1 − γ1 )M0 U 2 A is a sparse diagonally dominant matrix almost everywhere and its pattern is symmetric with entries like 3 × 3 blocks. For the sake of readability and clarity, we simulate the propagation of an elastic wave field in a homogeneous domain: a cube with L = 1000 m. The spatial component of the forcing term is located in a point that is in the middle of the domain so that it is represented by a delta function δ(x, y, z). On the contrary, the temporal component of the forcing term is represented by the so-called Gabor impulse, G(t) (Figure 1). Its mathematical form is * & ' + ωp (t − ts ) 2 G(t) = exp − cos[ωp (t − ts ) + ψ], (5) γ 0.2

0.15

0.1

0.05 Amplitude

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

˙ n + δt ((1 − γ1 )U ¨ n + γ1 U ¨ n+1 ) Un+1 = Un + δt U 2

0

−0.05

−0.1

−0.15

−0.2

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 1. The Gabor impulse with parameters fp = 0.45, ψ = (π/2), γ = 0.066, ωp = 2πfp and ts = 0.45(γ /fp ).

2051

International Journal of Computer Mathematics −3

x 10

2

1.5

1

Acceleration

0

−0.5

−1

−1.5

−2

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 2. The acceleration trend recorded in the source point P(500 500 500), in a homogeneous domain, that is a cube with L = 1000 m, vP = 2100 m/s, vS = 1200 m/s, fmax = 12 Hz and Nnode = 64,000.

−5

6

x 10

4

2

0

Velocity

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

0.5

−2

−4

−6

−8

−10

−12

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 3. The velocity trend recorded in the source point P(500 500 500), in a homogeneous domain, that is a cube with L = 1000 m, vP = 2100 m/s, vS = 1200 m/s, fmax = 12 Hz and Nnode = 64,000.

2052

S. Agreste et al. −6

8

x 10

6

4

Displacement

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2

0

−2

−4

−6

−8

0

0.2

0.4

0.6

0.8 Time(s)

1

1.2

1.4

1.6

Figure 4. The displacement trend recorded in the source point P(500 500 500), in a homogeneous domain, that is a cube with L = 1000 m, vP = 2100 m/s, vS = 1200 m/s, fmax = 12 Hz and Nnode = 64,000.

where fp is the predominant frequency of the impulse, ωp = 2πfp , ts = 0.45γ /fp , and γ and ψ are the parameters that must be chosen according to the width of the frequency band of the incoming wave field. In other words, our forcing term is not a monochromatic wave field but is made by a set of waves belonging to a frequency band. We depict the evolution of acceleration in Figure 2, of velocity in Figure 3 and of displacement in Figure 4, evaluated in the source point P. In this case, the P-wave velocity is vP = 2100 m/s, the S-wave velocity is vS = 1200 m/s, the maximum frequency is fmax = 12 Hz and the number of grid nodes Nnode = 64,000; with these conditions, the minimum edge is edgemin = 25.00 m, whereas the time step is δt = 4.03e − 004 s and the simulated seconds are 1.6 s.

3.

Parallel algorithm

In developing the parallel code, we have adopted two open-source software packages, LaGrit [13] and MeTiS [12]. In detail, the discretization of the physical domain has been obtained using LaGrit, whose mesh generation uses a Delaunay tetrahedralization algorithm [21,22]. LaGrit is a library realized by Los Alamos National Security, LLC at Los Alamos National Laboratory (LANL) with the US Department of Energy (DOE)2 . In order to distribute the domain among processors, we need to partition the mesh, and for that purpose, we used MeTiS. Its advantage is to have a partitioning algorithm able to minimize the connections among processors by minimizing the edge cut. The parallel algorithm has been implemented by means of MPI library, on the CASPUR cluster, MATRIX.

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

International Journal of Computer Mathematics

2053

Figure 5. MeTiS distribution of 91,125 nodes and 453,655 elements among 10 subdomains. Black points represent the overlapping nodes.

3.1

Domain decomposition

To the purpose of realizing a parallel algorithm, a suitable analysis of mesh partitioning and data parcelling has been achieved. 3.1.1 Mesh partitioning Once LaGrit has generated the mesh, MeTiS takes care in partitioning into subdomains the set of the mesh elements. In the case study presented in the previous section, the numerical grid obtained by LaGrit has 91,125 nodes connected by 453,655 tetrahedra. The Delaunay tetrahedralization algorithm has been applied to our domain. The result of the MeTiS partition with 10 subdomains is shown in Figure 5. Even though that the sets of tetrahedra and nodes provided by MeTiS are disjoint, the subdomains share some nodes anyway. Such nodes are called overlapping nodes. It is crucial to be able to identify them, that is why we implemented an ad hoc procedure. In Figure 5, the overlapping nodes are pointed out by black points. The mesh partitioning is designed in order to both minimize the communication time and care about the load balancing. 3.1.2

Parcelling

Once subdomains are realized, the next step is to distribute such subdomains among processors and this is made by one-to-one mapping between subdomains and processors. In such a framework, it is crucial to be able to identify the coupling between each node and its associated processor. This is realized by generating a global-to-local map [2,20]. When subdomains are distributed on np processors, each subdomain can generate its local mass matrix, A(j) , j = 1, . . . , np, that is sparse, with a symmetric pattern. In fact, the symmetric entries correspond to inner nodes and others correspond to nodes lying on the numerical faces, called boundary nodes. For this reason, we tidy up the inner nodes and then, in queue, the boundary ones. As a consequence, we can store only less than half of the entries of the mass matrices, by the Compact Sparse Row (CSR) format optimized for such kind of matrix, that is, sparse, with a symmetric pattern and suitably reordered matrices.

S. Agreste et al.

The assemblage of local mass matrices is achieved in parallel without communications among processors. The advantage of this approach is that the global mass matrix will be never composed in its explicit form. Some results on the load balancing associated with mesh partitioning are presented in [1], where the authors introduce two new parameters, namely closing nodes β and node balancing α. The former is linked with communications among processors that are evaluated in terms of the overlapping nodes of the parameter called closing nodes. The latter gives information on the data-load distribution. The index β represents the ratio between the total number of overlapping nodes NO and the total number of nodes N. By means of this parameter, we can estimate the growth of the matrix size caused by the overlapping nodes. The dimension of the local mass matrix is Nnode loc = (N/np)(1 + β), where np represents the number of processors. In Figure 6, we present the trend of the parameter β versus np, from 4 to 48. In particular, the cases N = 42,875; 54,872; 64,000; 74,088; 91,125 are shown. The number of overlapping nodes increases with the number of processors. However, if N raises then NO is growing steady. By the way of explanation, we set np = 40. If N = 42,875, then the number of overlapping nodes, with their multiplicity, weights upon number of nodes nearly 35% of N. Otherwise, if N = 91,125, then it weights upon nearly 26%. Node balancing α is the ratio between the sum of nodes of each processor Nt and the total number of nodes times the number of processors, N × np. To the purpose of investigating α, we considered N and np as above. In Figure 7, we present the trend of the parameter α. Then, increasing the number of nodes, α approaches to the ideal case (α = 1/np) represented by red star points. It means that work load is well distributed among processors.

0.4 42875 NODES 54872 NODES 64000 NODES 74088 NODES 91125 NODES

0.35

0.3

0.25

β

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2054

0.2

0.15

0.1

0.05

0

5

10

15

20

25

30

35

40

np

Figure 6.

Closing nodes versus number of processors. In the legend is the number of total nodes.

45

50

2055

International Journal of Computer Mathematics 0.35

42875 NODES 54872 NODES 64000 NODES 74088 NODES 91125 NODES

0.3

0.2

α

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

0.25

0.15

0.1

0.05

0

Figure 7.

3.2

0

5

10

15

20

25 np

30

35

40

45

50

Node balancing versus number of processors.

Solver phase

Once local mass matrices are stored, each processor solves its own linear system, that is, Equation (4) reduced to its own subdomain, at each time step n, that is, A(j) U¨ (j) n = bn ,

j = 1, . . . , np.

(6)

To the purpose of increasing the Newmark stability, we adopted γ1 = 41 and γ2 = 21 [7,14]. The obtained mass matrices are sparse, diagonally dominant a.e. and with a symmetric pattern. For this reason, the solver is based on the Gauss–Seidel method, implemented for the optimized CSR matrix storage [16]. Solving the system (6) by the Gauss–Seidel method, at time n, means finding the vector of the sequence indexed by k , , b(j) (i) − li A(j) (i, l)U¨ k−1 (l − 1) U¨ (j)k (i) = A(j) (i, i) ∀j = 1, . . . , np

such that

or rather for fixed tolerances tol1 and tol2 .

and

∀i = 1, . . . , Nnode loc

∥U¨ (j)k − U¨ (j)k−1 ∥ < tol1 ∥U¨ (j)k ∥ ∥A(j) U¨ k − b(j) ∥ < tol2

2056

S. Agreste et al.

Therefore, at each time step n, we have np local solutions U¨ (j) that are Nnode loc size vectors. The local solutions are assembled with the aim of reconstructing the global one by means of the global-to-local map, that is, map ¨ (j) U¨ (j) (ilocal ) U (iglobal ) * −→

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

∀j = 1, . . . , np

and

∀ilocal = 1, . . . , Nnode loc .

Thus, we obtain np vectors with size Nnode , whose non-zeros entries are only the corresponding U¨ (j) (ilocal ) elements, with ilocal = 1, . . . , Nnode loc . Note that on the overlapping nodes, we have as local solutions as their multiplicity, that is, the number of subdomains sharing them. For this reason, we applied a weighted mean, where the weights r (j) (i) are proportional to the node multiplicity in each subdomains. ¨ U(i) =

np ! j=1

U¨ (j) (i)r (j) (i) ∀i = 1, . . . , Nnode .

Because the local convergence does not ensure the overall one, there is a need to carry out the control of global norm at each time step n, that is, - n -∥U¨ ∥ − max ∥U¨ (j)n ∥- < tol j

with tol formerly fixed. It has been numerically proved that our choice of time step and space step guarantees both local and global convergence for the system (6). To summarize, at each time step, processors swap the results on overlapping nodes to obtain the weighted solution. Its norm is computed once the global solution has been reconstructed. All processors update their solution if the norm is less than a fixed tolerance.

4.

Numerical results

In this section, we present the performance of our algorithm. Numerical tests have been realized on a distributed memory cluster called MATRIX at CASPUR computer centre. It consists of 320 nodes, divided into front-end nodes, free interactive nodes and number-crunching nodes. All of them are two-way quad-core Opteron 2.1 GHz with 16 GB of RAM, except for the numbercrunching nodes which are equipped with 32 GB of RAM. MPI library has been adopted because it accommodates the partitioning of the problem. In order to evaluate the behaviour of our parallelization, we considered the computational times, the speed-up and the efficiency parameters. Computational times. In order to evaluate the performance of our parallel algorithm, we measured and analysed the runtime phase by means of a software tool, namely Scalasca [25]. In total, 10,000 time iterations have been executed with 42,875 nodes and 91,125 nodes by 8, 24, 48 cores. In Table 1, we report the percentage of the time spent in each simulation and distinguish three different categories. Type MPI refers to function calls to the MPI library. User program routines on paths that directly or indirectly call MPI provide valuable context for understanding the communication and synchronization behaviour of the parallel execution and are distinguished with the COM type from other routines that are involved with purely local computation marked USR. In Table 1, we can note that the MPI time increases with a great number of cores. In particular, the MPI time exceeds the 50% of the total time using 48 cores with N = 42,875. An improvement

2057

International Journal of Computer Mathematics Table 1. The percentage of time spent during the simulations when the mesh has 42,875 and 91,125 nodes. Category

8 PROC

24 PROC

48 PROC

42,875

MPI COM USR

4.97 0.77 94.26

20.23 0.82 78.92

51.13 0.80 47.26

91,125

MPI COM USR

6.36 1.04 92.60

13.38 0.90 85.72

32.90 0.71 66.07

of parallel performance is obtained by increasing the number of nodes. This slightly hinder the parallel scalability of the algorithm as widely discussed afterwards. Speed-up. S(n) is a useful index to check any gain in the performance with respect to the serial code. It is defined as the ratio between real time employed by one processor to run the code, T (1), and the real time needed by n processors, T (n), namely S(n) =

T (1) . T (n)

Note that the time T (1) is around 10,120 s with 42,875 nodes and is around 23,626 s with 91,125 nodes. For our modelling, the speed-up scaling is satisfactory up to 40 processors (Figure 8). Beyond this threshold, the saturation process takes place. This means that the communication time is near Speed up 50 Ideal case 233512 elem 300940 elem 345445 elem 413526 elem 453655 elem

45 40 35 30 25

Sp

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

Node number

20 15 10 5 0

Figure 8.

0

5

Speed-up.

10

15

20 25 30 Number of processors

35

40

45

50

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

2058

S. Agreste et al.

Figure 9.

Efficiency.

to the computational time needed by each processor to execute its work load. In other words, it means that we lost in communications the time gained increasing the number of processors. This because the speed-up depends both on the number of processor and on the domain decomposition. Efficiency. E(n) is defined by the ratio between the speed-up and the number of processors, namely S(n) E(n) = . n It is a measure of the relevance of the relative utilization of the resources of a parallel computer. The efficiency trend in Figure 9 shows that in our case study 30% of the running time is wasted only in communication and synchronization. During the rest of the time, processors run in parallel. At each time step, the processors swap information on overlapping nodes and this step is fundamental to synchronize the local solutions for obtaining the global one. Analysing results, it is clear that the speed-up saturation starts when the number of processors is near to 40. It means that we lost in communication the time gained increasing the number of processors. Then, the speed-up depends both on the number of processors and on the domain decomposition.

5.

Conclusion

In this work, we studied an IBVP describing the dynamics of interactions between seismic wave field and near-surface geological structures. A novel approach has been presented to solve this problem in parallel. By means of the domain decomposition technique, the original problem has been split into local problems. This subdivision allowed us to investigate those large-scale

International Journal of Computer Mathematics

2059

problems that a serial code is not able to solve. Closing nodes and node balancing have been introduced and analysed in order to stress the performance of our domain decomposition scheme. Speed-up and efficiency support and confirm such performance. Our numerical code is portable, readable and computationally efficient, which is designed and tuned for, but not limited to, seismological applications able to simulate a 3D seismological problem with a large scale.

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

Acknowledgements This work has been supported by GNCS–INdAM and Research Grants of the University of Messina. We are very grateful to Bruno Firmani (University of Trento), Piero Lanucara (CASPUR, Rome) and Luigia Puccio (University of Messina) for their help and criticism on mathematical as well as numerical aspects of the paper.

Notes 1. http://www.caspur.it. 2. http://lagrit.lanl.gov/.

References [1] S. Agreste and A. Ricciardello, Simulation of seismic wave propagation in 3d heterogeneous media: A parallel computing approach, Commun. Appl. Ind. Math. 1 (2011), pp. 1–14, doi:10.1685/journal.caim. [2] H. Bao, J. Bielak, O. Ghattas, L. Kallivokas, D. O’Hallaron, J. Shewchuk, and J. Xu, Earthquake ground motion modeling on parallel computers, Proceedings of Supercomputing ’96, Pittsburgh, PA, USA, 17–22 November, 1996, pp. 1–19. [3] P. Bard, Effects of surface geology on ground motion: Recent results and remaining issues, Proceedings of the 10th European Conference on Earthquake Engineering, Vienna, 1995, pp. 305–324. [4] J. Bielak, O. Ghattas, and H. Bao, Ground motion modeling using 3d finite element methods, Proceedings of the 2nd International Symposium on the Effects of Surface Geology on the Seismic Motion, Yokohama, 1998, pp. 121–133. [5] A. Caserta, A time domain finite-difference technique for oblique incidence of antiplane waves in heterogeneous dissipative media, Ann. Geophys. 41 (1998), pp. 617–631. [6] A. Caserta, V. Ruggiero, M. Busico, and I. Opršal, Parallelisation technique for serial 3d seismic codes: Sms approach, Ann. Geophys. 52 (2009), pp. 503–514. [7] E. Chaljub, D. Komatitsch,Y. Capdeville, J. Vilotte, B. Valette, and G. Festa, Spectral-element analysis in seismology, Adv. Geophys. 48 (2007), pp. 365–419. [8] R. Clayton and B. Engquist, Absorbing boundary conditions for acoustic and elastic wave equations, Bull. Seismol. Soc. Amer. 67 (1977), pp. 1529–1540. [9] R. Clayton and B. Engquist, Absorbing boundary conditions for wave-equation migration, Geophysics 45 (1980), pp. 895–904. [10] B. Engquist and A. Majda, Absorbing boundary conditions for the numerical simulation of waves, Math. Comp. 31 (1977), pp. 629–651. [11] B. Engquist and A. Majda, Radiation boundary conditions for acoustic and elastic wave calculations, Comm. Pure Appl. Math. 32 (1979), pp. 313–357. [12] G. Karypis and V. Kumar, MeTiS, A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices – Version 4.0, Army HPC Research Center, Department of Computer Science, University of Minnesota, 1998. Available at http://www.cs.umn.edu/∼karypis. [13] L.A.N. Laboratory, Los Alamos Grid Toolbox, LaGrit, Los Alamos National Laboratory, 2003. Available at http://lagrit.lanl.gov/. [14] N. Newmark, A method of computation for structural dynamics, ASCE J. Eng. Mech. Div. 85 (1959), pp. 67–94. [15] K. Olsen and R. Archuleta, Three-dimensional simulations of earthquakes on the Los Angeles fault system, Bull. Seismol. Soc. Amer. 86 (1996), pp. 575–596. [16] L. Puccio and A. Ricciardello, Algorithms for large sparse matrices based on finite element method, Commun. SIMAI Congr. 3 (2009), pp. 1–11. [17] A. Quarteroni, A. Tagliani, and E. Zampieri, Generalized Galerkin approximation of elastic waves with absorbing boundary conditions, Comput. Methods Appl. Mech. Eng. 163 (1998), pp. 323–341. [18] A. Ricciardello, Finite element method for seismological applications, Ph.D. thesis, Mathematics Department, Messina University, 2011. [19] V. Ruggiero, P. Lanucara, M.P. Busico, A. Caserta, and B. Firmani, Numerical modelling of the ground motion: A parallel approach for finite element methods, in Applied and Industrial Mathematics in Italy: Proceedings of 7th

2060

[20]

[21]

Downloaded by [CIBER Consortia CASPUR - CIBER] at 04:55 15 October 2012

[22] [23] [24] [25]

S. Agreste et al. Conference, Venice, Italy, M. Primicerio, R. Spigler, and V. Valente eds., Chap. 1, World Scientific, Singapore, 2004, pp. 487–495. E. Schwabe, G. Blelloch, A. Feldmann, O. Ghattas, J. Gilbert, G. Miller, D. O’Hallaron, J. Shewchuk, and S. Teng, A separator-based framework for automated partitioning and mapping of parallel algorithms for numerical solution of PDEs, Proceedings of 1992 DAGS/PC Symposium, Department of Mathematics and Computer Science, Dartmouth College, Hanover, NH, 23–27 June, 1992, pp. 48–62. J. Shewchuk, A condition guaranteeing the existence of higher-dimensional constrained Delaunay triangulations, SCG ’98: Proceedings of the Fourteenth Annual Symposium on Computational Geometry, Minneapolis, MN, USA, ACM Press, New York, June 7–10, 1998, pp. 76–85. J. Shewchuk, Tetrahedral mesh generation by Delaunay refinement, SCG ’98: Proceedings of the Fourteenth Annual Symposium on Computational Geometry, Minneapolis, MN, USA, ACM Press, New York, June 7–10, 1998, pp. 86–95. R. Stacey, Improved transparent boundary formulations for the elastic-wave equation, Bull. Seismol. Soc. Amer. 78 (1988), pp. 2089–2097. T. Tu, H. Yu, L. Ramirez-Guzman, J. Bielak, O. Ghattas, K.L. Ma, and D. O’Hallaronh, From mesh generation to scientific visualization an end-to-end approach to parallel supercomputing, SC2006, Tampa, FL, 2006. F. Wolf, Scalasca, Helmholtz-University Young Investigators Group, Tennessee University, 2010. Available at http://www.scalasca.org.

Somos uma comunidade de intercâmbio. Por favor, ajude-nos com a subida ** 1 ** um novo documento ou um que queremos baixar:

OU DOWNLOAD IMEDIATAMENTE