Parallel short-range molecular dynamics using the Ādhāra runtime system

June 23, 2017 | Autor: Hannes Jónsson | Categoria: Mathematical Sciences, Physical sciences, Domain Decomposition

Descrição do Produto

Computer Physics Communications ELSEVIER

Computer Physics Communications 102 (1997) 28-43

Parallel short-range molecular dynamics using the Adhfira runtime system S.G. Srinivasan a, I. Ashok b'l, Hannes Jrnsson c'2, Gretchen Kalonji a, John Zahorjan b a Department of Materials Science and Engineering, University of Washington, Seattle, WA 98195, USA b Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA c Department of Chemistry, University of Washington, Seattle, WA 98195, USA

Received 12 August 1996

Abstract We describe a parallel program for simulating molecular dynamics subject to short-ranged molecular interactions. The program uses a mntime system, Adh~ra, for atom definition, partitioning and distribution over the processors as well as dynamical load balancing between processors. A significant speedup is achieved by eliminating the usual test for the minimum image criterion, using conditional statements, that is applied to pairs of atoms during the force calculation. Instead, coordinates of boundary atoms are adjusted for the periodic boundary condition just before communication. The algorithm uses domain decomposition and the link-cell method as well as a neighbor list within cells. Benchmark simulations of systems with 864 and up to 5 000000 atoms run on an Intel Paragon computer are described. PACS: 02.70.-c; 02.70.Ns; 89.90.+h; 89.90.+n Keywords: Molecular dynamics; Parallel computers; Runtime system; Short range; Benchmark

1. Introduction It requires no leap o f imagination to recognize the potential rewards o f exploiting the p a r a l l e l i s m [ 1-3] inherent in the core of the molecular dynamics ( M D ) simulations [4,5]. Parallelism refers to the simultaneous computation on a large number of independent data objects using multiple, interconnected processors working cooperatively. The parallelism o f molecular dynamics is in the computationally intensive force evaluation. An immediate and hence tangible outcome of a shift to parallel computers is the speeding up of a given simulation, and the enhancement o f our ability ] Present address: Sonitech International, Wellesley, MA, USA. 2 E-mail: [email protected]

to simulate larger systems a n d / o r longer times. This has been enabled by the advent o f parallel computers. Several interesting concepts and innovative approaches for the design of efficient parallel molecular dynamics simulation algorithms have been presented [ 6 - 2 0 ] . Initial parallel computers were, in many ways, reminiscent of the early days of sequential computing when mastery of the machine hardware was mandatory to develop good codes, and the software was machine specific [9,11]. Several different parallel computer architectures have been proposed but after nearly a decade of evolution the emphasis has shifted to the multiple-instruction-multiple-data ( M I M D ) model [6]. However, with the vast array o f available processor hardware, an important goal has been to develop methodologies to make parallel

0010-4655/97/$17.00 Copyright @ 1997 Elsevier Science B.V. All rights reserved. PI1 SO0 10-4655 (97)000 15-5

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

codes as portable as possible. Towards this end, efforts have been directed to create translators, similar to the vectorization compilers, to permit recycling of the vast body of sequential codes onto parallel computers [1]. Another approach advocates building special purpose computers for specific problems such as MD [7,10,11]. More recently, the focus has shifted to building runtime systems dedicated to a class of specialized applications [21 ]. These systems are expected to prove advantageous, i.e., to deliver high efficiency, convenience of programming and portability because of the inclusion of the intrinsic parallelism of the problem into their basic design. AdhS_ra, meaning support in Sanskrit, is a runtime system developed by Ashok and Zahorjan [21,22] to support dynamic space-based simulations (electromagnetic particle-in-cell and rarefied fluid flow problems). We have tailored Adhftra to support short-range molecular dynamics simulations using a domain decomposition method. A d h ~ a supports several basic tasks such as atom definition, partitioning and distribution over processors. In addition, it incorporates mechanisms for sharing the required boundary atoms between processors and smart schemes for dynamic load balancing. These features allow the programmer to easily express the computation and parallelism in MD naturally. Thus, we are able to produce efficient, scalable and portable codes for execution on distributedmemory multi-processor machines. The term scalable refers to the ability to deliver performance directly proportional to size; in an ideal scaling, doubling the number of processors must cut the job execution time in half. The code is written in a high level language; ANSI C. In the simulations, the atoms are treated as point masses with short-range interactions. We have incorporated both pairwise (such as the LennardJones [4]) and many-body (such as the embedded atom method [23] ) potentials in our code. The shortrange interaction computations are spatially local, i.e., particles separated beyond a certain characteristic distance do not interact with one another [4]. Two particles in an interacting pair are called neighbors. We use a combination of the link-cell method [4,24] and neighbor lists [4,25] to efficiently find interacting neighbors. Our simulation space can be two- or three-dimensional, and periodic boundary conditions (PBC) [4] can be imposed. The size and shape of

29

our simulation box can also be varied using the constant pressure MD of Parrinello and Rahman [26]; simulations described in this paper, however, use only the constant volume ensemble. Our paper is structured as follows. Section 2 reviews the literature on parallel MD and highlights some existing challenges. Section 3 outlines the organization of Adh~a and the data structures used in its design. In Section 4, we benchmark our molecular dynamics code and compare its performance with that of others [ 6]. We conclude with a summary and suggestions for improvement.

2. Parallel molecular dynamics Several critical programming issues arise when parallel computers are used. Formulation of practical and efficient schemes for decomposing, or distributing, the MD computation among all the processors is probably the most important one. Atoms in MD are dynamic and may flow into some nonuniform and rapidly changing spatial distributions. In this scenario two key prerequisites need to be complied with for developing efficient MD codes. They include keeping all the nodes equally busy, which is called load balance, and minimizing the interprocessor communication. In addition, one must strive to minimize sequential sections and maximize the parallel components in the code to enhance its performance. In the following we use the terms node and processor interchangeably. Any scheme that distributes the computation among processors must ensure a consistent and reasonably equitable decomposition of the work [ 1,6,9]. Uneven workload distribution cause idling by some processors as they wait for the other overloaded nodes to finish their task. This is the load imbalance overhead. Workload may be quantified in terms of the total number of local atoms in each node, the size of the local pairs-list, the time taken by each node for computing a molecular dynamics time step or some combination of these. The equiload criterion occurs trivially for long-range interactions when an equal number of atoms reside in each node. On the other hand, for short-range interactions this must be enforced [6,9,13]. The distribution of atoms across processors results in a need to communicate boundary atom information between the nodes during force computation. Commu-

30

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

Partitioning all three dimensions

No partitioning in this dimension

Partitioning only in this dimension

./

BLOCK Scheme

BEAM Scheme

SUCE Scheme

Fig. 1. Choices for partitioning a three-dimensional space into eight rectangular domains. The three schemes differ by the total number of dimensions they partition. BLOCK-three dimensions, BEAM-two dimensions (3 different ways), SLICE-one dimension (3 different ways).

nication also accrues when the decomposition strategy necessitates the transfer of management of atoms to another processor, as in the case when an atom moves out of the geometrical region managed by a node into another processor's domain. This diverts the CPU for communications and introduces communication overhead. The communication characteristics are very hardware dependent; factors such as start-up time (latency) for each communication event, message bandwidth, and relative rate of computation to communication must therefore be incorporated into code design [ 16]. In synopsis, an efficient and scalable MD code must, simultaneously, minimize the communication overhead and maximize the load balance among processors. A number of papers [6,8-10,12-16,18-20] describe the different strategies used for parallelization of molecular dynamics. We summarize the two main approaches advocated for the division of computational workload among the processors. The first method, called the atom-decomposition or replicateddata method [6,12,13], decomposes the computational objects (particles or atoms) into subsets and permanently assigns the computation of each to a unique processor. Every processor also stores the coordinates of all the system particles, but it is only responsible for the computations involving its resident atoms. To compute forces at each step, each processor must have a complete list of current atomic coordinates from all the other nodes. This is accomplished by a global communications step at a cost that increases with system size and the number of processors used in the simulation. The large

memory consumption and the global communications used make this method inappropriate for large scale MD [6,8,9,12-15,18,20]. The advantage of the replicated data method is that it is easy to program, the computation can be trivially balanced across the processors, and it is geometry independent. In the second method, called the domain-decomposition or spatial-decomposition method [ 6,13 ], computational space is partitioned into disjoint subspaces or domains and each assigned to a unique processor. This assignment may be permanent, or the domains may be resized dynamically to balance the computational load [ 1,9,27]. Seven possible ways to partition a three-dimensional space are shown in Fig. 1. Most implementations, including ours, require the domain height to be at least the force interaction range in each dimension and confine the communication to nearest-neighbor domains. All the atoms that fall within a domain are called local-atoms and they belong to the processor managing that domain. Thus, the atom population in each processor (or domain) changes continually during the simulation, as the ownership of each atom transfers to the node managing the region in which it currently resides. Nodes must share information on atoms lying close to domain boundaries to compute forces. Boundary atoms in node-o~ for which information is sent to node-ft are called nonlocal atoms in node-ft. The domain-decomposition method, in contrast to the replicated-data method, uses only local communications for this purpose. Thus, the communication costs scale with the number of particles lying close to the domain boundaries instead of as the system size. The storage requirement is propor-

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43 tional to the number of atoms owned by each node. A node may not necessarily do all the computation for the atoms it manages; for example the forces on atoms lying close to an interprocessor boundary may be computed only by one of the nodes. The spatialdecomposition scheme is, therefore, most appropriate for large scale MD. Complexity of coding, strong dependence of the performance on the problem geometry, and sensitivity to load imbalances, however, are its complications. To summarize, interprocessor communication is required to maintain the temporal currentness of nonlocal atom coordinates. This overhead can be trimmed by exploiting spatial locality of the short-range forces via a suitable partitioning scheme, as shown in Fig. 1. The load balance is sustained by ensuring that the domains, in the partitioning scheme chosen above, have (nearly) equal work; this is the task of balancing the decomposition. Only domain-decomposition MD adequately resolves both these issues simultaneously and is, hence, the preferred method for large systems. In what follows we shall deal with this scheme.

3. Adhfira runtime system A detailed description of Adhgtra is presented elsewhere [21,22]. In this paper we only outline those features relevant to molecular dynamics. Adh~ra is implemented as a library of functions written in ANSI C and the communication is based on simple messagepassing support that is independent of processor connectivity. It presents a distributed memory, single program multiple data model (SPMD). The MD code is compiled and linked with the AdhS.ra runtime library and the system dependent message passing library, as shown in Fig. 2.

31

Fig. 2. Compiling moleculardynamics code using AdhSra. (a) Phase manager AdhRra divides the computation into phases and computes on the primary data structure of each phase. Molecular dynamics computes on ATOM data structure in the compute phase, which includes the coordinate updates, force evaluation and the velocity updates carried out for each atom. During each molecular dynamics time step, the phase manager first calls the load balancer and then executes the compute phase construct. If the load balancing results in a new partition, it invokes the appropriate data redistributor function to preserve spatial locality of the atoms.

3.1. An overview of Adh?tra methods

(b) Message handler High-level, machine-independent support to send and receive messages is provided by this module. It buffers up messages and combines messages going to the same destination node. This minimizes the number of operating system level messages. The message may be the coordinates of nonlocal atoms received from a neighboring node for the purpose of computing forces or the returned values of the computed forces for these atoms. It knows how to combine the received data with existing data. For example, a node adds the force it computed for one of its boundary atoms to the value computed and returned by a neighboring node.

Adhgra is organized into six modules, the phase manager, message handler, operating system interface, atom manager, atom redistributor and load balancer. The functions of these modules are briefly outlined below.

(c) Operating system interface All the machine-dependent message passing functions are confined to this module. This ensures easy portability. Many A d h ~ a modules invoke operating system interface functions to actually send or receive

32

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

messages, compute the global sum of a variable and SO o n .

(d) Atom manager This module creates and manages the link-cell framework in each node. The atom manager inserts each local and nonlocal atom into the appropriate slot of the link-cell, deletes the local atoms that have moved out of its domain, adjusts the coordinates of nonlocal atoms to impose periodic boundary conditions and generates the link-lists of interacting atom pairs. It maintains separate link-lists of local and nonlocal atoms.

(e) Atom redistributor The atom redistributor redistributes the atoms that have moved to another domain. It also computes the schedules for all boundary atoms that must be communicated to or received from neighboring domains, for pair-list generation. For example, a send-schedule will consist of a boundary subregion from which atoms are to be taken, and its destination node.

basing our partition scheme on this framework should be beneficial. The desired partitioning scheme is specified by the user. This design enables us to exploit the spatial locality of the particle data and balance the workload among the regions. We have implemented two and three-dimensional MD systems, but use only one and two-dimensional examples in our illustrations.

(a) Box partitioning and domain formation The three-dimensional simulation box can be partitioned in seven different ways shown in Fig. 1. This can be done either by a nonhierarchical or by a hierarchical scheme as shown in Fig. 3. Dimensions of the simulation box are partitioned in a random order in a nonhierarchical method. A hierarchical scheme [22,27] is better suited for MD as it can easily adapt to preferential atom movement. For example, if all the atoms move along the Y-dimension only the partitions at level-2 need to be adjusted. A d h ~ a uses hierarchical partitions.

(b) Spatial discretization hierarchy (f) Load balancer Computing an ideal partition can be very expensive in molecular dynamics, as the load distribution varies spatially and also with time. Hence, Adh~ra determines a reasonably good decomposition and then balances the load only when the load imbalance exceeds a specified threshold value. It employs a distributed scheme to communicate the estimated load and to recompute the partition. The load-balancing (LB) frequency may be specified by the user. The LB costs are amortized over the time steps between each LB event. The details are described elsewhere [27].

3.2. Basic concepts of molecular dynamics used in Adh?~ra The computation space is a three-dimensional rectangular box, specified by its edges and of fixed volume, that contains the system atoms. It is partitioned by the domain decomposition method into rectangular blocks of reasonably equal load but probably unequal volumes, as shown in Fig. 1. Rectangular partitions are used because of the relative simplicity in their maintenance and repartitioning. We use a link-cell grid for speeding up the force computations in MD. Hence,

The partitioning into equiload regions can be done by either directly sorting the atoms by their coordinates or by discretizing the box into fine rectangular cells and estimating the load in each cell [22]. The sorting gives a finer balance but is very expensive. We use a two-level spatial discretization to partition the box and balance the load. At the basic level we discretize the simulation box into subcells, a lot finer than the force interaction range, using a regular static grid called the CS-Grid (for Computational-Space Grid). At the second level we have the load-discretizing grid or LD-Grid. We call the cells formed by the CS-Grid and the LD-Grid as CS-CelIs and LD-Cells, respectively. The entire discretization process will be explained below. The essential function of the CS-Grid is to discretize the double precision atom coordinates for load balancing and to represent domain boundaries. Atom coordinates discretized by the CS-Grid have integer values and are called CS-Coords, this discretization is done whenever the link-lists are updated. A rectangular domain block is defined by specifying the integer CS-Coords of its diagonally opposite corners. As an example, a two-dimensional box is discretized by a 8 x 8 CS-Grid and partitioned among four nodes, as

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

33

Level-2

-'=X

@

@

0

0

Y

•

•

@

0

•

• @

O~

@

•

@ 0

t @

~u

@ o

0

•

• @

•

0 e 0

•

•

•

0 0

0 o •

@

•

0

•

• 0

0

Level-1

@

0 •

•

@

•

(b) Hierarchical Partitioning

(a) Non-Hierarchical Partitioning

Fig. 3. This figure illustrates partitioning of a two-dimensional box into nine domains by nonhierarchical and hierarchical partitioning schemes. In the former, the two dimensions are partitioned in a random order. In the latter, the box is first partitioned into three strips along the X-direction. Each strip is then divided into three pieces, each having three atoms, along the Y-direction. In our hierarchical scheme, the number of partition levels equals the number of dimensions (two).

7

;

6 4

2

;

+ i

+

I

.

t

-.

,

- + - ~'="~ I

-

,

i

DO1

I

i

0

021

-- "f- -- ÷

I

I

i

I

I

I

I

I

0 1 2

i

.

+ - ~ D31

-+-~

I

i

- 4

CS-Slot Domain

with i n d e x (2, 5)

Boundario~

D 0 : [(0, 0), (2, 2)]

I

-÷-+-+-÷-+-~ I I I I I I , , , , , I I I I I I _÷_+_ _+_+_+_~ --4--÷---

I

;

-

1

y

;

D / ~

5 3

;

- + - + - ÷ - + -

-

01:[(0,3),(4,7)] 02:[(3,0),(7,2)] 03:[(5,3),(7,7)]

i

-- ÷---I I

I

[ I

3 4 5 6 7

~X Fig. 4. Discretization of a two-dimensional box by a 8 × 8 CS-Grid (shown by dashed lines) and its partitioning into four domains DO to D3. The partition boundaries are represented by thick lines.

shown in Fig. 4. The load-discretizing grid or the LD-Grid is built from CS-Cell blocks. It is used to estimate the spatial atom distribution for the load-balancing operation, by counting the number of atoms in each of its cells. Thus, the LD-Cells can be as fine as a CS-Cell or an integer multiple thereof. The important point is the domain partition boundaries coincide with the LD-Grid and can be moved in increments of LD-Cell width; these points are illustrated in Fig. 5.

(c) ATOM data structure All the information for each atom is stored together in a C data structure called ATOM. We use C pointerbased doubly linked lists or lists to associate atoms sharing common attributes, such as lying in the same link-cell or being managed by the same node. Each processor computes link-cell indices and CS-Coords from the atom coordinate values. All the atoms with CS-Coords lying within the boundary of a processor's domain are identified as local atoms of that domain.

S.G, Srinivasan et aL /Computer Physics Communications 102 (1997) 28-43

34

LD-Grid

Boundary I I I I I I 1

I I I I I I

i

partition boundary Fig. 5. Shows the one-dimensional discretization and partitioning of a box, among two nodes, to illustrate the relation between CS-Grid, LD-Grid and the partition boundaries. CS-Grid is shown by dashed lines, LD-Grid by thin solid lines and partition boundaries by thick lines. LD-Grid coincides with every fourth CS-Grid boundary, or a LD-Cell is four CS-Cells wide. All boundaries are aligned with the CS-Grid lines. We control the increments of partition movement during load balancing by adjusting the width of LD.CelI.

Each node identifies the local atoms it is responsible for, gets memory from the free-pool and inserts them into appropriate lists. Replicas of nonlocal atoms, communicated to a node by its neighboring processors, are also stored in local memory although they lie outside of the node's domain.

(d) Sharing boundary atom information between processors The atom redistributor computes the windows of processor border subregions from where atoms must be sent to neighboring nodes prior to force computations. This information is stored in the form of communication schedules. These schedules consist of windows of domain border subregions, each at least as wide as the force interaction range and specified in terms of their CS-Coords, and their destination nodes. These schedules are recomputed whenever the domain boundaries get shifted due to load balancing. The atoms lying within these cell(s) can now be easily determined from the list scheme outlined earlier, and their coordinates can be sent to neighboring nodes for force computation. All the border atoms, in a node, having the same destination node are buffered and sent together. An atom in a border cell is sent only if its CS-Coords are within the interaction range (plus a small buffer) from the domain boundary. We put the already existing CSGrid and the atom CS-Coord to the auxiliary use of approximating atom distances to domain boundaries and to other atoms during pairs generation. The in-

teger comparison used here, as a first approximation, should speed up this computation. The distances for the force evaluation are, then, computed using the double precision atom coordinates. Computation of the communication schedules is illustrated in Fig. 6 using a two-dimensional box divided among four nodes. For a three-dimensional box, we send the atoms in the thirteen border subblocks in thirteen of the twenty six directions to avoid duplication of atom pairs used in the force calculations. In Fig. 6, note that the boundary atoms are sent only in the negative (X, Y) directions.

(e) Link-cell framework and pairs-list generation A combination of the standard link-cell method and neighbor table methods are used to speed up the pairslist generation and the force computation. The linkcells, each at least as wide as the interaction range, are the coarsest slots and can be loosely treated as another level of spatial discretization. To generate pairs, we loop through the coarse link-cells and pair up atoms in each cell with atoms within the same cell and 13 of the 26 neighboring cells. We use integer CS-Coords to eliminate pairs separated beyond the force interaction range. Once constructed, the pairs list is reused until the next link-list update. In each pair (atom 1, atom2), we ensure that atoml is always a local and atom2 is positioned in the spatially positive direction with respect to atoml. Since we move the partition boundaries in increments of LD-Cell width, a link-cell can be shared among different processors. Prior to each link-list update, any memory previously allocated to nonlocal atoms in each processor is returned to the free-memory-pool. In the first instant after each link-list update, memory is allocated for each nonlocal atom received. At this time we receive atom coordinates as well as the CS-Coords for the nonlocal atoms. For the time steps between two successive list updates, we receive only the atom coordinates of each nonlocal atom. These coordinates are inserted into the memory location allocated to it earlier, using a lookup table scheme. The forces are computed by looping over the pairs list. Since there is no duplication of pairs, Newton's third law is used for both local and nonlocal atoms. Forces computed for the nonlocal atoms are sent back to their owner nodes by another communication step. It involves looping through the sorted nonlocal atom

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43 e

35

J

¢

I

1W->3E IS->0N 1SW->2NE ISE->2NW

3W->IE 3S->2N 3SW->0NE 3SE->0NW

I.

J

b

f

II

h

g

b

NW

N A

W - ~ - S |,%d

NE

IJ

II

a

i

h

c

d

t.-E

e

e

0W->2E 0S->IN 0SW->3NE 0SE->3NW b

J

a

i

2W -:>0E 2S->3N 2SW-> 1NE 2SE-> 1NW

k

h

g

f

g

b

J

a

i

k

h

g

Fig. 6. This figure illustrates the communicationof the domain boundary atoms, lying in the four demarcated windows, for force computations. We use a two-dimensional system partitioned among four nodes for ease of illustration. Windows acdia and abfga are the West and South Edges, while windows abjia and hkfgh are the South-West (SW) and South-East (SE) comers of each domain. Directions are established from center of each domain. Notation: 1W --* 3E means the window on the west edge of node-1 is sent to the east edge of node-3. serial list, buffering the forces to be sent and finally sending them.

(f) Avoiding the minimum image calculation during force computation Information about atoms lying near the simulation box partition boundaries need to be communicated to other processors (the "nonlocal" atoms). The destination processors o f these atoms, and hence the boundary regions outside the box where these atoms must be copied, are known unambiguously, i.e. the atom information is sent only to the processor, or processors in the case o f simulation box corner atoms, whose atoms might interact with a given boundary atom. It is pos-

sible to avoid the conditional statement computation, involved in enforcing the m i n i m u m image convention, which usually is applied to all pairs. Thus, the coordinates o f only the boundary atoms are adjusted by adding or subtracting lattice vectors of the PBC lattice [4] as the information about these atoms is communicated to the relevant processor. This results in a significant saving for large systems, since only few atoms lie close to the box boundaries therein, while the usual m i n i m u m image test gets applied to all relevant pairs in the system. This is related to the suggestions by Rapaport [9] and by Grest et al. [ 28 ] for introducing "replica" or "ghost" atoms to improve vectorization o f M D codes.

36

S.G. Srinivasan et aL / Computer Physics Communications 102 (1997) 28-43

But therein, an additional cost arises because more pairs need to be evaluated for the extended system than for the actual system. In parallel MD implementations, such as ours, there is no escape from the communication and replication of the noniocal atom information. Thus, it comes at no extra cost to eliminate the test for the minimum image criterion for the atom pairs. To the best of our knowledge this has not been done in other parallel MD implementations reported in the literature. As discussed in the next section, we find this speeds up the computation by about 20% for a system with 106 atoms.

Table l Summary of the system parameters used in the benchmarking experiments; * denotes parameters in reduced units. Simulation parameter

Parameter value

Dimension of the system 3 Periodic boundary conditions Yes Number of atoms (N) 864 to 5 million Number of processors I to 512 Temperature (T ~ = Tko/e) 0.72 Timestep (At* = At(~/m0-2) °.5) 0.00462 Starting density (p* = N0-3/V) 0.8442 Potential cutoff (rcut) 2.50Buffer for cutoff length (rskin) 0.30CS-Grid used for box discretization 16384 x 16384 x 16384 Link-cell update frequency Every 20 time steps

4. Benchmark results and discussions In this section we report benchmark simulations that test the scalability of our code. We also attempt to quantify the communication cost, in order to be able to identify means to reduce it. Following Plimpton [6], we use a system of atoms interacting via the LennardJones potential [4] to benchmark our parallel code. The Lennard-Jones potential between a pair of atoms has the analytic form I ( O ' X ~ 12

qb(rij) = 4 e L k r i ) /

(0")

-

~

6]

"

Here, 0" and e are the length and energy parameters and r 0 is the pair separation distance. The distance derivative of this function yields the force values. The optimization and benchmarking were carried out on the 512-node Intel Paragon at Caltech running the OSF operating system. The systems are initially face centered cubic and their size varied from 864 to 5 000000 atoms. They are simulated at the LennardJones state point, a liquid state close to the triple point, defined by a reduced temperature T* of 0.72 and a reduced density p* of 0.8442. The best timing is reported for our benchmark simulations. The other parameters used in the simulation are summarized in Table 1. The need for such a benchmark to accurately assess the performance of a parallel code, and to reasonably compare with different implementations, has been discussed by Plimpton [6]. We use a velocityVerlet method for integration [4]. Fig. 7 shows the CPU time expended per MD time step (7"D against the number of system atoms (N)

in the system; the number of processors ( P ) used are fixed at 64, 128, 256 and 512. The algorithms scale linearly for large N values. In Fig. 8 we plot 7"~ versus P, and N is varied from 10976 to 1 000 188 atoms. This plot illustrates the good scaling behavior of the algorithms. We see that, for a fixed system size, ~'s decreases almost linearly with the increase in P. As the system size increases, the slopes of these curves decrease towards a minimum value of minus one, the scope for an ideal scaling. This is due to the enhanced ratio of computation to the communication times in larger systems. In Figs. 7 and 8, there is a deviation from linearity for small N and large P, due to the high ratio of communication costs to the computational costs. This deviation is more marked for our plots than for Plimpton's. The average number of atoms per node ( N / P ) versus 7"~ values for our code is plotted in Fig. 9, for different P values. The curves (nearly) superimpose for ( N / P ) values greater than approximately 300. In this region, the rs value varies directly with the number of atoms in a processor and is independent of P. Lastly, in Fig. 10 we see that the communication costs in our code exceed the computation costs when there are fewer than 100 atoms per node. The computation and communication timing results of our benchmarks are summarized in Tables 2-5. Plimpton [6] has demonstrated that his spatialdecomposition algorithm possesses one of the best performances reported. Further, he has also established that his code competes favorably with, the fastest vectorized MD code reported in the litera-

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

37

Number of Atoms Versus Total Time Per MD Step

r~ @ ra~

0.1

0.01

....

....

0.001

Q ....

OUR

[]

SJP

o ....

OUR

o

SJP

64-Nodes

....

64-Nodes 128-Nodes

....

128-Nodes

411. . . .

OUR

"

SJP

~ ....

OUR

~

SJP

256-Nodes

256-Nodes 512-Nodes 512-Nodes

I

I

I

10000

100000

10000(~

Number

of

Atoms

(N)

Fig. 7. Total time per MD step (rs) as a function of the number of atoms (N). Graphs of our double precision timings are shown by dashed lines and marked "OUR". Single precision timings from running Plimpton's code are shown by continuous lines and are marked "SJP". Runs were made on 64, 128, 256 and 512 nodes. ture [28], on a single Cray Y - M P or C-90 processor for the entire range of problem sizes. Thus we chose to compare the timing results from our simulations with those of Plimpton, performed under similar conditions. The parallel timings for Plimpton's code on the Intel Paragon are for single precision (32-bit) implementations, while those of our code are for double precision (64-bit) implementations. Plimpton [6] estimates that, on Intel machines, the computation times would increase by 2 0 - 3 0 % while the communication time would double if he were to switch from the sin-

gle to a double precision mode. Overall, depending on the relative computation to communication ratio, the expectation is that double precision simulations to be 2 0 - 5 0 % slower than single precision. For most problems it is about 20% slower [29]. A comparison of our timings with that o f Plimpton's must make allowances for this crucial difference but adjustments for this have not been made in Figs. 7-9. Table 2 shows that, overall, our code is very competitive with that of Plimpton's. The timings for our code are, in many instances, faster than the 2 0 - 5 0 %

38

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

Number

of Processors

Versus

Total Time

Per MD

Step

10 O.

1

0.1 t.

0.01 .... • ....

OUR 500,000 Atoms

~-

SJP 500,000 Atoms

.... U....

0.001

OUR 1000,188 Atoms Stl000k I

I

10

100

1000

Number of Nodes (P) Fig. 8. Total time per MD step (Ts) versus number of processors (P) for different system sizes (N). Graphs of our double precision timings are shown by dashed lines and marked "OUR". Single precisiontimings from running Plimpton's code are shown by continuous lines and are marked "SJP". slow down Plimpton estimates for double precision codes. We believe that this is due to eliminating the m i n i m u m image tests and computations in the force calculation (as shown in Table 5). Instead, we adjust the coordinates of the nonlocal atoms for the periodic boundary prior to communication. This scales with the number of atoms close to the box boundary, while the m i n i m u m image computation scales as the number of pairs. Plimpton's code, on the other hand, computes the m i n i m u m image distance for each pair.

There is a more rapid deterioration in the performance of our code than that of Plimpton's for a smaller number of atoms and on a larger number of nodes. This may be attributed to the following two factors. Very naively, since our implementation is in double precision, our communication costs must essentially double. However, in many cases shown in Table 3, we find that our communication costs are close to three times Plimpton's values. This is due to the fact that as our code allows for dynamical load balancing, hence

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

39

Number of Atoms Per Processor Versus Total Time

10

1

O ¢J

m.

0.1

t.

. . . . 13. . . .

OUR 64-Nodes

[]

SJP 64-Nodes

li,°°o°v'°' .... ~ ....

i,-I. °

E

o~

A

OUR 128-Nodes

SJP 128-Nodes

O.Ol .... ~ ....

OUR 256-Nodes SJP 256-Nodes

0.001 10

.... q ....

OUR 512-Nodes

--

SJP 512-Nodes

!

I

I

100

1000

1000O

N u m b e r of A t o m s Per Node (N/P) Fig. 9. Variation of elapsed time per MD step (Ts) with the number of atoms per node (N/P). Graphs of our double precision timings are shown by dashed lines and marked "OUR". Single precisiontimings from running Plimpton's code are shown by continuouslines and are marked "SJP". subdomains do not align and a processor has more neighbors (13 in a three-dimensional system). Plimpton's code, on the other hand, has simple aligned subdomains and thus only needs six communications in a three-dimensional system. In a two-dimensional system, each node, in our code, sends data to four of the eight neighboring nodes. Table 5 shows the results of simulations with and without applying tests for the m i n i m u m image criterion. It is clear that elimination of the m i n i m u m image criterion test in the distance calculations during

the force computation results in substantial CPU savings. The time per MD step is cut by at least 6.64% (for 10976 atoms on 256 processors) and as much as 17.50% (for a l 000 188 atoms on 32 processors).

5. C o n c l u s i o n s a n d s u m m a r y That massively parallel computers have carved themselves a niche in molecular dynamics simulations is indeed a fair accompli. However, widespread

S.G. Srinivasan et al./Computer Physics Communications 102 (1997)28-43

40

Table 2 Number of atoms vs time per MD step (in seconds) for simulation using our code and Plimpton's code (labeled SJP). These simulations were performed on Intel Paragon computer. The percentages are computed using SJP's time as base values. #Atoms N

2048 6912 10976 16384 32000 55296 108000 500000 1000188 5038848

64 Nodes

128 Nodes

OUR

SJP

%At

0.021 0.040 0.052 0.072 0.124 0.191 0.353 1.488 2.873

0.012 0.029 0.044 0.061 0.113 0.181 0.344 1.457 2.778 . .

72.50 37.59 17.50 17.21 9.9l 5.69 2.59 1.12 3.42 .

256 Nodes

OUR

SJP

%At

OUR

.

.

0.035 0.047 0.075 0.Ill 0.195 0.772 1.481

. . 0.025 0.035 0.062 0.098 0.182 0.751 1.436

. 40.00 33.43 20.65 13.37 6.87 2.84 3.11

.

. 0.029 0.036 0.052 0.072 0.I14 0.413 0.777 3.541

512 Nodes

SJP

%At

.

OUR

.

%,,5t

0.009 0.013 0.021 0.031 0.053 0,202 0,380 1,824

185.6 140.8 98.10 77.42 49.25 22.62 17.82 5.25

.

. 0.015 0.020 0.035 0.054 0.097 0.391 0.737

SJP

. 92.67 82.00 49.14 33.15 17.94 5.63 5.41 -

0.026 0.031 0.042 0.055 0.079 0.248 0.448 1.920

Table 3 Number of atoms vs communication time per MD step (in seconds) for simulation using our code and Plimpton's code (labeled SJP). The percentages are computed using SJP's time as base values. #Atoms N

2048 6912 10976 16384 32000 55296 108000 500000 1000188 5038848

128 Nodes

64 Nodes OUR

SJP

%~,t

OUR

0.011 0.015 0.016 0.019 0.025 0.029 0.040 0.084 0.114 -

0.004 0.006 0.007 0.008 0.011 0.014 0.022 0.049 0.070 -

165.0 153.3 128.6 137.5 129.1 109.3 79.55 71.43 63.14 -

0.014 0.017 0.021 0.024 0.031 0.062 0.086

256 Nodes

SJP

%At

0.006 0.006 0.009 0.011 0.016 0.033 0.054

126.7 175.0 133.3 121.8 95.00 88.18 59.26

OUR

512 Nodes

SJP

. . . . . . 0.014 0.005 0.016 0.006 0.020 0.007 0.023 0.009 0.027 0.011 0.049 0.025 0.069 0.035 0.164 -

%At

OUR

. . . . . 180.0 0.012 166.7 0.015 185.7 0.019 155.6 0.024 145.5 0.030 96.00 0.058 97.14 0.083 0.208

SJP

%At

0.004 0.005 0,006 0,008 0,009 0,018 0.026 0.063

210.0 206.0 223.3 203.8 230.0 220.0 220.8 230.8

Table 4 Number of atoms vs communication and compute times per MD step (in seconds) for the Adh~ra MD code. Lattice

8 12 14 16 20 24 30 50 63 108

× x x x x x x x x x

#Atoms N

8 × 8 12 × 12 14 × 14 16 x 16 20 x 20 24 x 24 30 x 30 50 x 50 63 x 63 108 x 108

2048 6912 10976 16384 32000 55296 108000 500000 1000188 5038848

64 Nodes

128 Nodes

Comn

Comp

0.011 0.015 0.016 0.019 0.025 0.029 0.040 0.084 0.114 .

0.007 0.021 0.032 0.049 0.094 0.156 0.306 1.384 2.718 .

.

.

256 Nodes

512 Nodes

Comn

Comp

Comn

Comp

Comn

Comp

0.014 0.017 0.021 0.024 0.031 0.062 0.086

0.017 0.026 0.049 0.081 0.156 0.697 1.364

0.014 0.016 0.020 0.023 0.027 0.049 0.069 0.174

0.009 0.014 0.026 0.042 0.080 0.352 0.685 3.351

0.012 0.015 0.019 0.024 0.030 0.058 0.083 0.208

0.004 0.007 0.013 0.021 0.040 0.177 0.344 1.680

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

41

Number of Atoms Per Node Vs Time for Computation or Communications 10 64-Nodes Communication 64-Nodes Computation

o ¢d ¢9

JL

°~

128-Nodes Communication

ga, . . . .

•l i b

. .

° °

128-Nodes Computation 256-Nodes Communication 256-Nodes Computation

g~

512-Nodes Communication 512-Nodes Computation

o~

f E o

0.1-

L

o o

E o

0.01

I¢"d

t~

0.001

10

I

I

I

100

1000

10000

Number

of

Atoms

Per

Node

100000

(N/P)

Fig. 10. Graph of computation and communication times per MD step versus number of atoms per node ( N / P ) for A-dh~a MD code. Table 5 Number of atoms vs time per MD step (in seconds) results for our code comparing simulations that uses minimum image criterion test for all relevant pairs used in force computation with those not using this test. The former is denoted by MI and the latter by NMI. The percentages are computed using the NMI times as base values. #Atoms N

10976 108000 500000 1000188

32 Nodes

64 Nodes

128 Nodes

256 Nodes

MI

NMI

%At

MI

NMI

%At

MI

NMI

%At

MI

NMI

%At

0,102 0,828 3.588 6.883

0.088 0.716 3.074 5.858

15.72 15.64 16.72 17.50

0.056 0.429 1.834 3.498

0.049 0.373 1.577 2.977

14.43 15.01 16.30 17.50

0.037 0.236 0.950 1.790

0.034 0.207 0.823 1.531

8.71 14.01 15.43 16.92

0.032 0.132 0.500 0.930

0.027 0.118 0.437 0.799

6.64 11.86 14.42 16.40

42

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

acceptance of these machines has been limited by difficulties in coding, optimizing and balancing the computations. Adhfira, a specialized runtime system for dynamic space-based simulations, alleviates many of these problems. In this work we described a shortrange parallel molecular dynamics program based on Adhfira. We have used a link-list based domain decomposition method to exploit the spatial locality of the forces and to balance the computation across the nodes. The basic design is based on an hierarchy of grids, it is possible to dynamically resize the domains to maintain load balance. The portability is ensured by confining machine specific calls to message passing library functions in a separate module. Any other machine specific constructs and other optimization are avoided. Benchmarking experiments were performed on an Intel Paragon parallel computer with up to 512 processors. The results revealed that our code exhibits optimal O ( N / P ) scaling for large problems and is very competitive with the well-known code of Plimpton. Further, our code is more general and incorporates an efficient dynamic load-balancing scheme to sustain good performance during the entire duration of the simulation. Lastly, the results also show that the avoidance o f the minimum image test during the force computations results in savings that are substantial for large systems.

Acknowledgements We would like to express our thanks to Steve Plimpton of Sandia for his kind help and constant encouragement, and for sending us a copy of his molecular dynamics code for comparative purposes. Further we are indebted to him for reading our manuscript and for his meaningful suggestions. The code development was done on the 16-node Paragon computer donated by the Intel Corporation to the University o f Washington; we thank Jan Sanislo for his help during this period. We also thank Sharon Brunnet of the Caltech Concurrent Supercomputing Consortium for her help in porting the code on to the 512 node Intel Paragon at Caltech. The optimization and benchmarking was made possible with the computing time from Battelle Pacific North West National Laboratory ( P N N L ) , Richland, WA. S.G.S. wishes to thank L. Rene Corrales, R.J. Harrison, Matt Ros-

ing, J. Nieplocha, R.A. Kendall, J.A. Nichols and T.E Straatsma of PNL for making his 1995 summer internship at PNL very pleasant and educational, and for sharing their collective experiences on the idiosyncracies of parallel computing and molecular dynamics methods. This work was funded in part by the Kyocera Corporation ( G . K . ) , Washington Technology Center (J.Z.) and N S F grants CHE-9217774 (H.J.), CCR-9123308 (J.Z.) and CCR-9200832 (J.Z.).

References [1] G.C. Fox, M.A. Johnson, G.A. Lyzenga, S.W. Otto, J.K. Salmon and D.W. Walker, Solving Problems on Concurrent Processors, Vol. I (Prentice-Hall, Englewoed Cliffs, NJ, 1988). [2] DAEDALUS, A new era in computation, J. Am. Acad. of Arts and Sciences 121 ([992). [3] D.J. Wallace, Philos. Trans. R. Soc. Lond. A 326 (1988) 481. [4] M.R Allen and D.J. Tildesley, Computer Simulation of Liquids (Clarendon, Oxford, 1987). [5] G. Cicotti and W.G. Hoover, Eds., Molecular Dynamics Simulation of Statistical Mechanical Systems (NorthHolland, Amsterdam, 1986). [6] S. P[impton, J. Comput. Phys. 117 (1995) 1. [7] B.M. Boghosian, Comput. in Phys. 4 (1990) 14. [8] A.I. Melcuk, R.C. Giles and H. Gould, Comput. in Phys. 4 (1991) 311. [9] D.C. Rapaport, Comput. Phys. Rep. 9 (1988) 1. [10] D. Fincham, Mol. Simul. 1 (1987) 1. [ 11 ] D.C. Rapaport, in: Computer Modeling of Fluids, Polymers and Solids ( Kluwer Academic Publishers, Dordrecht, 1990) p. 249. [12] A.R.C. Raine, D. Fincham and W. Smith, Comput. Phys. Commun. 55 (1989) 13. [13] W. Smith, Comput. Phys. Commun. 62 (1991) 229. [14] S.Y. Liem, D. Brown and J.H.R. Clarke, Comput. Phys. Commun. 67 (1991) 261. [ 15 ] D.C. Rapaport, Comput. Phys. Commun. 62 ( 1991 ) 217. [16] M.R.S. Pinches, D.J. Tildesley and W. Smith, Mol. Simul. 6 (1991) 51. [ 17] S. Gupta, Comput. Phys. Commun. 70 (1992) 243. [18] D. Brown, J.H.R. Clarke, M. Okuda and T. Yamazaki, Comput. Phys. Commun. 74 (1993) 67. [ 19] K. Esselink, B. Smit and P.A.J. Hilbers, J. Comput. Phys. 106 (1993) 101. [20] D.M. Beazley and P.S. Lomdahl, Parallel Computing 20 (1994) 173. [21] I. Ashok and J. Zahorjan, Proc. of the Scalable HighPerformance Comp. Conf.-94 (IEEE Compul. Soc. Press, Los Alamitos, CA, 1994) p. 168. [22] I. Ashok, Runtime support for dynamics space-based applications on distributed memory multiprocessors, Ph.D. Thesis, University of Washington (1995).

S.G. Srinivasan et al./Computer Physics Communications 102 (1997) 28-43

[23] M.S. Daw and M.I. Baskes, Phys. Rev. B 29 (1984) 6443. [241 R.W. Hockney and J.W. Eastwood, Computer Simulation Using Particles (Adam Hilger, New York, 1988). [25] L. Verlet, Phys. Rev. 159 (1967) 98. [26] M. Parrinello and A. Rahman, J. Appl. Phys. 52 (1981) 7182.

43

127] S.G. Srinivasan, I. Ashok, H. J6nsson, G. Kalonji and J. Zahorjan, Comput. Phys. Commun. 102 (1997) 44, next paper. [28] G.S. Grest, B. Dunweg and K. Kremer, Comput. Phys. Commun. 55 (1989) 269. [29] S. Plimpton, personal communication (1996).

Lihat lebih banyak...

Parallel short-range molecular dynamics using the Ādhāra runtime system

Descrição do Produto

Comentários