Ab initio quantum chemistry on a ccNUMA architecture using openMP. III

May 26, 2017 | Autor: Roberto Gomperts | Categoria: Cognitive Science, Distributed Computing, Parallel Computing, Quantum Chemistry, Performance, Shared memory, Gaussian, Hartree fock method, Shared memory, Gaussian, Hartree fock method

Share Embed

Denunciar este link

Descrição do Produto

Parallel Computing 26 (2000) 843±856

www.elsevier.com/locate/parco

Ab initio quantum chemistry on a ccNUMA architecture using openMP. III C.P. Sosa a,*, G. Scalmani a, R. Gomperts b, M.J. Frisch c a

Silicon Graphics Computer Systems, 655 E. Lone Oak Dr., Eagan, MN 55121, USA b Silicon Graphics Computer Systems, 1 Cabot Rd., Hudson, MA 01749, USA c Lorentzian, 140 Washington Avenue, North Haven, CT 06473, USA Received 1 August 1999; accepted 1 October 1999

Abstract In this study we report the implementation of Gaussian 98 using OpenMP. OpenMP is a standard for parallel programming on shared memory computers. We compare the performance of OpenMP with methods such as Hartree±Fock (HF) and density functional theory (DFT), including ®rst and second derivatives of the energy. In addition we also look at CIsingles (CIS). Performance was investigated with up to 32 processors and compared against the standard version of Gaussian 98. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Gaussian; ccNUMA; Performance; OpenMP; Origin2000

1. Introduction Developers of chemistry applications have long realized that parallel computing bene®ts not only from scalable hardware but also from scalable software. Scalable hardware comes in a variety of architectures [1]. However, how the address space (memory) is organized has produced two basic approaches when it comes to parallelizing applications [2]: distributed memory, in which each processor has its own local memory; and shared-addressable memory among all processors (shared memory). Gaussian 98 has been parallelized taking these two approaches into consideration. In previous studies we have reported the implementation and performance of

*

Corresponding author. E-mail address: [email protected] (C.P. Sosa).

0167-8191/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 0 ) 0 0 0 1 5 - 6

844

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

Gaussian on a cluster of UNIX workstations [3], vector supercomputers [4], and massively parallel machines [5]. In particular, we have shown that self-consistent ®eld (SCF) and density fuctional theory (DFT) calculations with and without ®rst and second derivatives can run eciently on massively parallel machines with up to 32 and 64 processors [5]. Of course, scalability is dependent on the size of the problem (there are three major parameters that determine the size of the problem: number of atoms, theoretical method and basis set size). Similar results were observed for the CI-singles (CIS) approximation. In this work we look at implementation and performance of Gaussian using OpenMP on a shared memory machine such as the cache coherent non-uniform memory access (ccNUMA) Origin2000 [6]. This implementation is perhaps if not the ®rst, certainly one of the ®rst major chemistry applications using OpenMP. This paper focuses on the implementation of OpenMP in Gaussian. However, several other groups have devoted considerable eorts in the parallelization of quantum mechanical codes. Harrison and Shepard [7,8] have published an extensive review on ab initio molecular electronic structure on parallel computers. They not only look at parallel computers and parallel programming models but in addition, they also examine dierent parallel algorithms commonly used in ab initio codes. Harrison [9] has also reviewed parallel programming models in chemistry and has identi®ed NUMA a key concept in parallel computing. In his paper he points out the natural mapping between NUMA and message-passing programming models [10]. In this work we concur on the importance of NUMA and in addition we show that for a ccNUMA architecture such as the Origin2000 [11], standardization of shared memory programming models is important as well. Applications parallelized using OpenMP are portable and very easy to use. Recently, Dagum and Menon [12] have published a paper where they compare OpenMP with other parallel programming models. In their paper they concluded that in addition to the fact that OpenMP is supported by a number of hardware and software vendors, it provides a standard environment and a powerful and easy way to achieve scalability [13]. 2. Design features of a ccNUMA machine Multiprocessor shared memory machines have had the advantage that ®ne grain parallelism could easily be achieved with the use of compiler directives at the DO LOOP level. This is one of the most desirable features of, for example, vector supercomputers where each processor has a homogeneous access to all the memory in the machine. In this type of machines parallelism is achieved by adding proprietary directives to distribute DO LOOP constructs among all the processors [14]. New architectures such as the cache coherent non-uniform memory access machines (ccNUMA), although, the memory is physically distributed, ccNUMA computers present all memory using a uni®ed, global address space [11]. In other words, as far as applications are concerned, it is a shared memory machine. Each node has its own local memory and access to all remote memory. This is an important point for the programmer because, although, memory is shared, access is

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

845

nonuniform (dierent latency). Fortunately, the operating system (OS) has been enhanced to take advantage of the NUMA architecture. In other words, the OS will attempt to make use of memory locality as much as possible. The management of memory locality is done through low-level system calls. However, application programmers can ®ne-tune their codes via compiler directives or high-level command tools. This means that parallelism can be accomplished by just simply inserting directives. A key feature of the Origin is its cache coherence (cc). Since this feature is implemented in hardware, it is sucient to mention that on the Origin cache coherence is not achieved via a bus, like in the Power Challenge, but instead by using a cache directory. From a programmer perspective, it is desirable to properly utilize cache to obtain optimal performance. In this study we used Silicon Graphics Origin2000, it represents an example of ccNUMA architecture. The Origin2000 is equipped with MIPS R10000 microprocessors. In addition to its local memory the R10000 uses a two-level cache hierarchy, one internal to the processor (L1) and one external (L2). The L1 cache is 32 KB and the L2 is 4 MB in size [6]. The Origin system is a modular system, it can be expanded from 1 processor to 256 processors single system image. The module unit is called the node. Each node contains two processors, memory, and a hub (custom circuit that directs the ¯ow of data between processors, memory, and I/O). The hub also determine whether memory requests are local or remote (based on the physical address). All the calculations were carried out using a modi®ed version of Gaussian 98 Rev. A.6 [15] on two dierent Origin200 systems. The ®rst system consisted of 64 processors running at 195 MHz and with a total of 16384 MB of memory. The second Origin2000 had 128 processors running at 250 MHz and 36608 MB of memory. 3. Gaussian parallelization using openMP Gaussian [15], a connected series of programs, can be used for performing a variety of semi-empirical, ab initio, and density functional theory calculations and more recently, molecules can be partitioned in layers. Each layer can be computed at dierent levels of theory (including molecular mechanics). This new functionality is commonly referred as the ONIOM method [16]. Each program communicates with each other through disk ®les. All the individual programs are referred as links, where links are grouped into overlays [17]. In general overlay zero is responsible to start the program, including reading of the input ®les. Once the route card is read, the proper set of overlays/options/links is selected for a particular run. Overlay 99 (L9999) terminates the run, in most cases L9999 produces a summary of the calculation (archive entry). The Gaussian architecture on shared-addressable or distributed memory machines is basically the same, that is, each link is responsible for continuing the sequence of links by invoking the exec() system call to run the next link. Prior to looking at OpenMP within Gaussian, we need to describe OpenMP [12]. In the past, parallelizing an application on a shared memory machine involved

846

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

adding directives to the code. In most cases the directives were not portable. As previously mentioned, on vector machines the eorts of parallelizing DO LOOP constructs where initially introduced with Macrotasking, which in reality was an initial attempt of doing coarse grain parallelism via the use of library calls for synchronization. This of course meant some restructuring of the code. Microtasking evolved from Macrotasking and it became a friendlier way to parallelize applications, it involved adding proprietary directives at the DO LOOP level. Finally Autotasking was capable of recognizing certain DO LOOP constructs and compiler directives were added automatically, still proprietary directives. A similar story may be said about other vendors with proprietary compiler directives or newer encarnations of this type of directives. In this programming model the bulk of the work consists in identifying which variables are local (or private) and which variables are shared, this is commonly known as scoping variables. In contrast, messagepassing is a more dicult parallel programming model, it requires data to be partitioned and messages have to be explicitly send/receive to make use of the data. Considerable amount of explicit synchronization is involved. On the other hand, OpenMP not only takes advantage of shared memory architectures but it represents a portable alternative [18] built upon the experience and maturity of proprietary compiler directives. OpenMP does not require explicit data distribution. Therefore, there is an incremental path to parallelizing an application. OpenMP may be de®ned as a set of compiler directives and callable runtime libraries for shared memory computers. In comparison to the X3H5 parallel programming model [19], it provides coarse grain parallelism. The language extensions or directives can be classi®ed into three categories: (i) control structure, (ii) data environment, and (iii) synchronization. As previously pointed out, OpenMP also provides callable libraries. An application parallelized based on this model starts execution as a single process or master thread of execution. This master thread continues executing the program until it encounters a parallel region or block of code that needs to be executed in parallel. To start a parallel region OpenMP uses the PARALLEL and END PARALLEL directive. This parallel directive creates a team of threads plus the master thread. The block of code included within this directive is executed in parallel, including subroutine calls. At the end of the parallel construct all the threads synchronize and the master thread is the only one that continues with the next section of sequential code. Now within the Gaussian context, in a SCF scheme, the two-electron integrals are part of the Fock matrix, XX 1 1 Pkq lmjkq ÿ lkjmq ; Flm Hlm 2 q k where Hlm represents the core Hamiltonian. l, m, k and r are atomic orbital indices (AO). The quantities (lvjkr) are two-electron repulsion integrals. In Gaussian, these quantities are computed once and stored or recomputed many times as needed, depending on the memory available and the algorithm chosen. In previous papers we have provided a detailed description of the parallelization of two-electron repulsion

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

847

integrals within the PRISM scheme [20]. In addition, we have also compared our approach with other authors [5]. In this study, we just brie¯y highlight the steps that are relevant to the use of OpenMP. Parallelization of the Fock matrix involves distributing batches of integrals among all the available processors (this is carried out in a routine called PRSMsu). The parallel structure of Fock matrix elements formation for Hartree±Fock (HF) and DFT, as previously presented, are shown in Fig. 1 and [5]. At the end of the ®rst loop over Nprocessors , all contributions to the Fock matrix have been computed, the last DO loop adds all the contributions together in a serial block of code. Fig. 2 shows how the outer loop can easily be parallelized with one OpenMP directive. In the case of HF, prior to calling PRISM, the PRSMsu routine distributes all the work that is passed to PRISM to compute all the two-electron integrals. Fig. 2 corresponds to a small section of PRSMsu. The execution of this routine begins running serially, it initially computes lengths for some of the required arrays. It also checks if the number of processors requested in the input are consistent with the amount of memory available for the calculation. When the PARALLEL DO is encountered, a team of threads is created with the data environment for each team member. In this case the data consists of all the variables de®ned as private. This is carried out in DO LOOP 200. The PARALLEL DO provides a simpli®ed way to specify that there is a single DO LOOP in a parallel region. The iterations immediately following the DO directive are executed in parallel. The clauses following the OpenMP directive, Default(shared) declares that the variables in the lexical extent of the parallel region, including common block variables (THREADPRIVATE variables not included) are shared. The clause Schedule(static,1) forces the iterations to be divided among threads one by one in a roundrobin fashion. At the end of this DO LOOP, all the contributions are summed up as previously pointed out. This simple directive to parallelize this DO LOOP has allowed us (for comparison only) to do a one to one mapping between proprietary SGI

Fig. 1. Parallelization of PRISM.

848

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

Fig. 2. OpenMP directive to distribute the computation of two-electron integrals.

directives (DoAcross) and OpenMP. This has provided us with a direct comparison of the performance between proprietary directives, OpenMP directives and the standard version of Gaussian based on fork/join parallelism. 4. Performance We have measured the performance of the OpenMP version by comparing the speedup and eciency against the standard version. Speedup (S) is de®ned as the ratio of the serial run time (elapsed time, ts ) over the time that it takes to do the same problem in parallel (elapsed time, tp ). S

ts : tp

2

Eciency (e) is the fraction of time that a processor is doing useful work. It also requires that as the number of processors is increased so does the requirement for the percentage of parallel code. Eciency tends to be a more realistic measure when comparing dierent runs against the same number of processors. e

S : NPEs

3

In addition, rather than comparing our results to an ideal speedup, we have computed (using AmdahlÕs law) an extrapolated speedup(s) [6]. This extrapolated speedup is a function of p (percentage of parallel code).

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

s

1 : p=Nprocessors 1 ÿ p

849

4

The fraction p of the code was obtained as indicated in [6]. Measuring two dierent speedups can provide a formula for p p

SNprocessors ÿ SMprocessors ; 1 ÿ 1=Nprocessors SNprocessors ÿ 1 ÿ 1=Mprocessors SMprocessors

5

where Nprocessors and Mprocessors are two dierent numbers of processors whose speedups were previously computed. We have chosen to use some of the systems presented in our previous set of benchmarks [5], that is, single point energy (SP), FORCE (the time required for a geometry optimization is a multiple of the time needed for a FORCE calculation) and frequency calculations. Without discussing each theoretical method in detail (see, for example, [21]), we have chosen to use HF [21], the three parameter density functional method due to Becke [22,23] (B3-LYP), and CIS energy and gradients [24]. The following test cases were used throughout this study. The ®rst case corresponds to a-pinene (C10 H16 ). The computations for this case consist of single point calculations at the Hartree±Fock (HF) level with 6±311G(df,p) basis sets [23]. We have also performed single point energy calculations at the HF level with 3±21G basis sets [23] on taxol(C47 H51 NO14 ). We carried out a frequency calculation on a-pinene (C10 H16 ) at the B3-LYP [22] level with 6±31G(d) basis sets. The last test case is a CIS excitations FORCE calculation using 6±31++G basis sets [23] on acetylphenol. These molecules were chosen as small to intermediate size molecules to test speedup and eciency of our OpenMP implementation. In particular, the frequency and CIS calculations test most of the links that are parallelized. Tables 1 and 2 summarize the performance for the single point energy calculation a-pinene and taxol. As we pointed out in a previous study [5], in these cases 99.5% of the time is spent in the SCF iterative scheme (L502). Moreover, the percentage of parallel code computed according to Eq. (5) indicates that this section of the program is more than 90% parallelized. In other words, the fact that the diagonalization of the Fock matrix is carried out serially, does not play a signi®cant role for a moderate number of processors, at least not for the set of systems tested in this study. The same is true for the links that are doing mainly set up (L1, L101, L202, and L301). Although scalability is an important issue, our main concern is the performance of the OpenMP implementation compared to the version based on SGI proprietary directives (from now on this version will be referred as the DoAcross version) and the standard version based on an explicit fork/join parallel model (we will refer to this version as the standard version). Tables 1 and 2 illustrate the performance of these three versions of the program on an Origin2000 with a 195 MHz microprocessor. For comparison, they also show timings with a 250 MHz microprocessor for the OpenMP version only. In both examples we see that the performance of the OpenMP version is comparable to the DoAcross and standard versions. Up to 8 processors the eciency is

850

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

Table 1 Hartree±Fock single-point energy calculation on a-pinene (C10 H16 )a L502b

S

e

Totalc

S

e

4701 (4696) [4700]

1.0 (1.0) [1.0]

1.0 (1.0) [1.0]

4717 (4710) [4715]

1.0 (1.0) [1.0]

1.0 (1.0) [1.0]

2

2444 (2431) [2438]

1.9 (1.9) [1.9]

1.0 (1.0) [1.0]

2462 (2449) [2457]

1.9 (1.9) [1.9]

1.0 (1.0) [1.0]

4

1317 (1317) [1310]

3.6 (3.6) [3.6]

0.9 (0.9) [0.9]

1338 (1339) [1332]

3.5 (3.5) [3.5]

0.9 (0.9) [0.9]

8

759 (758) [753]

6.2 (6.2) [6.2]

0.8 (0.8) [0.8]

785 (787) [782]

6.0 (6.0) [6.0]

0.8 (0.8) [0.8]

16

495 (494) [492]

9.5 (9.5) [9.6]

0.6 (0.6) [0.6]

538 (540) [542]

8.8 (8.7) [8.7]

0.6 (0.5) [0.5]

3439 1696 922 530 346

1.0 2.0 3.7 6.5 9.9

1.0 1.0 1.0 1.0 0.9

3451 1711 939 554 380

1.0 2.0 3.7 6.2 13.5

1.0 1.0 1.0 0.9 0.8

Number of PE 195 MHz 1

250 MHz 1 2 4 8 16 a

Total of 346 basis functions; basis sets are 6±311G(df,p). All timings are in seconds and correspond to elapsed time. c Total time to complete the run (elapsed time); numbers in parentheses correspond to the DoAcross version; numbers in square brackets correspond to the standard version. b

larger than 80% within these three parallel models. Furthermore, Figs. 3 and 4 show a comparison between ideal speedup, extrapolated speedup using AmdhalÕs law and the computed speedup using elapsed timings for a-pinene and taxol, respectively. The extrapolated speedups were computed for p 96%. We see that the computed speedups are consistent with extrapolated speedups all the way up to 16 processors. In both cases the eciency is larger than 60%. Taxol shows a similar pattern as a-pinene (see Table 2 and Fig. 4). Up to 8 processors the speedup follows the 96% parallel from AmdahlÕs law. Table 1 also illustrates the dierence in performance between the 195 MHz and 250 MHz microprocessors. In the case of single processor the improvement in going from 195 to 250 MHz is a factor of approximately 1.40. Although the size of the secondary cache on both machines and the binaries used were the same, the clockspeed of the secondary cache on the 250 MHz machine is higher than the 195 MHz machine. Improvements in clock speed for the cache and CPU are responsible for the speedup. It is interesting to note that contrary to what one might expect, the 250 MHz machine shows slightly better scalability when using 2 or more processors.

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

851

Table 2 Hartree±Fock single-point energy calculation on taxol (C47 H51 NO14 )a Number of PE

L502b

S

e

Totalc

S

e

1

5849 [5779] (5814)

1.0 [1.0] (1.0)

1.0 [1.0] (1.0)

5935 [5866] (5901)

1.0 [1.0] (1.0)

1.0 [1.0] (1.0)

2

3043 [3047] (3174)

1.9 [1.9] (1.8)

1.0 [1.0] (0.9)

3124 [3125] (3289)

1.9 [1.9] (1.8)

1.0 [1.0] (0.9)

4

1672 [1672] (1692)

3.5 [3.5] (3.4)

0.9 [0.9] (0.9)

1762 [1757] (1788)

3.4 [3.3] (3.3)

0.8 [0.8] (0.8)

8

984 [994] (991)

5.9 [5.8] (5.9)

0.7 [0.7] (0.7)

1077 [1083] (1090)

5.5 [5.4] (5.4)

0.7 [0.7] (0.7)

a

Total of 660 basis functions. All timings are in seconds and correspond to elapsed time; basis sets are 3±21G. c Total time to complete the run (elapsed time); numbers in parentheses correspond to the DoAcross version; numbers in square brackets correspond to the standard version. b

Fig. 3. L502 comparison between ideal speedup (black bars), extrapolated speedup (white bars), and speedup computed using elapsed time (light gray) for a-pinene.

This dierence may be attributed to the dierence in memory size between the two machines, the total number of processors and the dierence in the OS as well. In general, in AmdhalÕs law it is assumed that all the processors are equal. This is not exactly true on a ccNUMA machine. Some processors are farther removed from where data is located and thus may incur into a greater read latency problem. Read

852

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

Fig. 4. L502 comparison between ideal speedup (black bars), extrapolated speedup (white bars), and speedup computed using elapsed time (light gray) for taxol.

latency is the time to access the ®rst word of a cache line read from memory, in nanoseconds. For local memory the read latency is 313 ns. For hub-to-hub direct connection the maximum latency is 497 ns. For larger con®gurations, the maximum latency grows approximately 100 ns for each router [6]. To put this in perspective, the total read latency to access far away memory locations is still of the same order of magnitude as current generation of symmetric multiprocessor systems (1 ls). A more extensive test of the OpenMP implementation may be seen in Table 3. This table corresponds to a frequency calculation, frequency calculations exercise more links since dierent terms of the ®rst and second derivatives of the energy are computed in dierent links. The links that are parallelized in this type of calculations are: L502, l703, L1002 and L1110.L502 has been discussed in the previous section and will not be considered here. L703 computes two-electron integrals ®rst and second derivatives. L1002 solves the CPHF equations to produce the derivatives of the molecular orbital coecients. Finally, L1110 computes the two-electron contribution to the Fock matrix derivative with respect to nuclear coordinates [25]. Since the parallelization is carried out in PRSMsu, dierent types of integrals are handed to PRISM, including ®rst and second derivatives. Thus, most of the time consuming terms are parallelized via PRSMsu/PRISM. If we look at the OpenMP version exclusively, we see that the scalability is very good up to 16 processors, even when we double the number of processors (32) the eciency is still 50%. This is consistent with the other two versions. Although, as the number of processors increases (>16), we observed that scalability for the OpenMP version is slightly better than the DoAcross and standard versions. It comes as no surprise that the solution of the CPHF equations does not show the same type of scaling as L1110 nor L703. In L1002 both the integral evaluation and the

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

853

Table 3 B3-LYP frequency calculation on a-pinenea PEsb

L1110

L1002

L703

Totalc

S

e

1

10445 (10431) [10448]

10313 (10230) [10227]

11553 (11531) [11529]

34648 (34520) [34534]

1.0 (1.0) [1.0]

1.0 (1.0) [1.0]

2

5073 (5085) [5056]

6045 (6038) [6290]

5970 (5958) [5966]

18332 (18325) [18584]

1.9 (1.9) [1.9]

1.0 (1.0) [1.0]

4

2497 (2498) [2559]

2906 (2912) [2970]

2996 (2988) [3045]

9094 (9091) [9281]

3.8 (3.8) [3.7]

1.0 (1.0) [0.9]

8

1316 (1317) [1321]

1953 (1914) [1978]

1612 (1608) [1606]

5324 (5286) [5367]

6.5 (6.5) [6.4]

0.8 (0.8) [0.8]

16

693 (698) [688]

1339 (1304) [1338]

951 (941) [937]

3331 (3307) [3357]

10.4 (10.4) [10.3]

0.7 (0.7) [0.6]

32

448 (426) [428]

1030 (1063) [1235]

578 (641) [581]

2409 (2513) [2742]

14.4 (13.7) [12.6]

0.5 (0.4) [0.4]

a

Total of 182 basis functions; the basis sets correspond to 6±31G(d). All timings are in seconds and correspond to elapsed time; all timings are for the OpenMP version. c Total time to complete the run (elapsed time). b

contraction of integral derivatives are parallelized, and the latter is the dominant (only formally N5 ) step. Better load balance has been implemented in the next release. On the other hand, the OpenMP version shows slightly better scalability for L1110 and L703 compared to the other two versions. Clearly the OpenMP version is competitive with the other two versions if not slightly better, specially as the number of processors increases. Finally, we carried out a single point energy calculation with the FORCE keyword at the CIS level of theory. CIS represents an inexpensive way of getting a ®rst approximation to excited states. In Gaussian, L914 computes excited states using CI with single excitations Table 4. It diagonalizes a matrix of the type hwia jH wHF i, where wHF corresponds to the HF wavefunction and wia is the wavefunction coming from single excitations [24]. The total CIS energy is an eigenvalue problem. The twoelectron integrals contributing to the CIS energy can be computed by means of the PRISM algorithm. Similar to previous cases, ®rst derivatives of the CIS energy show a good speedup up to 16 processors. We should also point out that this particular calculation is also dominated by contraction of integrals with density matrices, similar to CPHF (although not as many matrices). Both CPHF and CIS are not very sensitive to integral evaluation speed or parallelism, but rather to eciency of the ``digestion'' of the integrals and to load balance.

854

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

Table 4 Elapsed timings and total speedups for CI-Singlesa energy and gradients calculation on acetyl-phenol (C8 H8 O2 ) PEs

L502b

L914

L1002

L703

Totalc

Sd

e

1 2 4 8 16

1249 682 384 209 150

2191 1144 632 370 242

1103 578 321 192 112

269 141 78 45 31

4827 2565 1449 851 593

1.0 1.9 3.3 5.7 8.1

1.0 1.0 0.8 0.7 0.5

a

Total of 154 basis functions; the basis sets are 6±311++G. All timings are in seconds and correspond to elapsed time. c Total time to complete the run (elapsed time); all timings correspond to the OpenMP version. d Total speedup. b

5. Summary SCF and DFT calculations with and without ®rst and second-derivatives have been shown to run eciently on systems with up to 16 processors (32 processors for a-pinene frequency calculation). However, scaling is clearly dependent on the size of the problem. This is illustrated with a-pinene and taxol, two structurally dierent molecules. In this study we have looked at a-pinene as an example of a molecule that shows good scaling as a function of the basis sets. Similarly, taxol which is a larger molecule shows an eciency larger than 70% with 8 processors. Although, this is a very good eciency and consistent within AmdhalÕs law, it is slightly smaller than a-pinene for the same number of processors. Since cutos are dierent for these two systems, the number of two-electron integrals and the load balance might be dierent for these molecules. Further studies to provide more details on these dierences are currently being carried out. The best performance was observed with the a-pinene frequency, in particular, L701 and L1110 showed a good scalability up to 32 processors. CIS approximation provides a direct method of computing excited states with minimal disk storage required. Due to its simplicity, CIS provides a good approximation for large systems. Our calculations show good speedups for the links that handle the derivatives and good overall speedup throughout the entire calculation (up to 16 processors). OpenMP provides a simple way to parallelize even complicated loops such as the one presented in Fig. 2, it is also well suited to replace proprietary directives. In this work we found a one to one mapping between either Cray directives on vector machine or SGI Origin directives with OpenMP directives. In terms of performance we not only did not ®nd any degradation in performance going to OpenMP but we were pleasantly surprised to observe that scalability was slightly better for large numbers of processors.

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

855

Acknowledgements The authors thank the Corporate Computer Network at Silicon Graphics, Inc. in our Eagan facilities for time and resources provided to carry out this study. We also would like to thank Dr. Ramesh Menon and Dr. Joseph Ochterski for valuable discussions.

References [1] M.J. Flynn, Proc. of the IEEE 54 (1966) 1901. [2] M.J. Flynn, IEEE Trans. on Computers C 21 (1972) 948. [3] D.P. Turner, G.W. Trucks, M.J. Frisch, Ab initio quantum chemistry on a workstation cluster, in: T.G. Mattson (Ed.), Parallel Computing in Computational Chemistry, ACS Series 592, American Chemical Society, Washington, DC, p. 62. [4] J. Ochterski, C.P. Sosa, J. Carpenter, Performance of parallel Gaussian 94 on cray research vector supercomputers, in: B. Winget, K. Winget (Eds.), 1996 Cray User Group Proceedings, Cray User Group, Shepherdstown, WV p. 108. [5] C.P. Sosa, J. Ochterski, J. Carpenter, M.J. Frisch, J. Comp. Chem. 19 (1998) 1053. [6] D. Cortesi, ORIGIN2000 Performance Tuning and Optimization, Silicon Graphics, Mountain View, CA, 1998. [7] R.J. Harrison, R. Shepard, For an extensive review of previous work parallelizing chemistry codes, Annu. Rev. Phys. Chem. 45 (1994) 623. [8] R.J. Harrison, R. Shepard, Special issue on parallel computing in chemical physics, Theoretica Chimica Acta 84 (1993) 225±474. [9] R.J. Harrison, Chemical Design Automation News 8 (1993) 27. [10] R.J. Harrison, Int. J. Quantum Chem. 40 (1991) 847. [11] D.E. Lenoski, W.D. Weber, Scalable Shared-Memory Multiprocessor, Morgan Kaufmann, San Francisco, 1995. [12] L. Dagum, R. Menon, IEEE Computational Sci. Engrg. 5 (1998) 46. [13] More information is available at the ocial web site: http://www.openmp.org. [14] Cray Research, CF77 Commands and Directives SR-3771, Mendota Heights, MN 1993. [15] M.J. Frisch, G.W. Trucks, H.B. Schlegel, G.E. Scuseria, M.A. Robb, J.R. Cheeseman, V.G. Zakrzewski, J.A. Montgomery Jr., R.E. Stratmann, J.C. Burant, S. Dapprich, J.M. Millam, A.D. Daniels, K.N. Kudin, M.C. Strain, O. Farkas, J. Tomasi, V. Barone, M. Cossi, R. Cammi, B. Mennucci, C. Pomelli, C. Adamo, S. Cliord, J. Ochterski, G.A. Petersson, P.Y. Ayala, Q. Cui, K. Morokuma, D.K. Malick, A.D. Rabuck, K. Raghavachari, J.B. Foresman, J. Cioslowski, J.V. Ortiz, B.B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. Gomperts, R.L. Martin, D.J. Fox, T. Keith, M.A. Al-Laham, C.Y. Peng, A. Nanayakkara, M. Challacombe, P.M.W. Gill, B. Johnson, W. Chen, M.W. Wong, J.L. Andres, C. Gonzalez, M. Head-Gordon, E.S. Replogle, J.A. Pople, Gaussian 98 Rev. A.6, Gaussian, Pittsburgh, PA, 1998. [16] M. Svensson, S. Humbel, R.D.J. Froese, T. Matsubara, S. Sieber, K. Morokuma, J. Phys. Chem. 100 (1996) 19357. [17] M.J. Frisch, A.B. Nielsen, á. Frisch, Gaussian 98 ProgrammerÕs Reference, Gaussian, Pittsburgh, PA, 1998, p. 5. [18] MIPSpro FORTRAN 77 ProgrammerÕs Guide, Silicon Graphics, 1996. [19] B. Leasure (Ed.), Parallel Processing Model for High-Level Programming Languages, American National Standard for Information Processing Systems, 1994.

856

C.P. Sosa et al. / Parallel Computing 26 (2000) 843±856

[20] P.M.W. Gill, M. Head-Gordon, J.A. Pople, J. Phys. Chem. 94 (1990) 5564. [21] W.J. Hehre, L. Radom, P.V.R. Schleyer, J.A. Pople, Ab initio Molecular Orbital Theory, Wiley, New York, 1985. [22] A.D. Becke, J. Chem. Phys. 98 (1993) 5648. [23] M.J. Frisch, á. Frisch, J.B. Foresman, Gaussian 98 UserÕs Reference, Gaussian, Pittsburgh, PA, 1998. [24] J.B. Foresman, M. Head-Gordon, J.A. Pople, M.J. Frisch, J. Phys. Chem. 96 (1992) 135. [25] M.J. Frisch, á. Frisch, J.B. Foresman, Gaussian 94 ProgrammerÕs Reference, Gaussian, Pittsburgh, PA, 1995, p. 5.

Lihat lebih banyak...

Ab initio quantum chemistry on a ccNUMA architecture using openMP. III

Descrição do Produto

Comentários