Extensible PGAS semantics for C++

June 28, 2017 | Autor: Douglas Gregor | Categoria: Distributed Shared Memory System

Descrição do Produto

Extensible PGAS Semantics for C++ Nick Edmonds

Douglas Gregor

Andrew Lumsdaine

Open Systems Laboratory Indiana University Bloomington, IN 47405 [email protected]

Open Systems Laboratory Indiana University Bloomington, IN 47405 [email protected]

Open Systems Laboratory Indiana University Bloomington, IN 47405 [email protected]

Abstract The Partitioned Global Address Space model combines the expression of data locality in SPMD applications, which is crucial to achieving good parallel performance, with the relative simplicity of the Distributed Shared Memory model. C++ currently lacks language support for PGAS semantics; however, C++ is an excellent host language for implementing Domain-Speciﬁc Embedded Languages (DSELs). Leveraging these capabilities of C++, we have implemented the Partitioned Global Property Map, a DSEL library supporting PGAS semantics, polymorphic partitioned global data structures, and a number of useful extensions. The Partitioned Global Property Map library utilizes template meta-programming to allow direct mapping at compile-time of high-level semantics to efﬁcient underlying implementations. It combines ﬂexible/extensible semantics, high performance, and portability across different low-level communication interfaces to allow PGAS programs to be expressed in C++.

1.

Introduction

A wide variety of HPC applications ranging from computational ﬂuid dynamics to N-body simulation have been parallelized using the Single-Program Multiple-Data (SPMD) technique. Within the SPMD domain there are several potential programming models that vary based on how data is viewed by the concurrent programs. Shared memory models assume that data exists in a shared local address space available to all programs. Message passing models assume that each program has a private address space and that data is moved between address spaces using explicit messages. The Distributed Shared Memory (DSM) model allows programmers to program the conceptually simpler shared memory

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. PGAS ’10 New York, New York USA. c 2010 ACM 978-1-4503-0461-0/10/10. . . $10.00 Copyright

model, with remote memory accesses implemented via a runtime which utilizes message passing or RMA on behalf of the user. The major shortcoming of the DSM model is that it does not allow programmers to differentiate between local and remote data in order to express the data locality available in the application. This can lead to excessive remote memory accesses and reduced performance. The Partitioned Global Address Space (PGAS) model addresses this shortcoming of the DSM model by providing methods to explicitly identify local and remote data and express the spatial locality available in applications to achieve high performance. The PGAS model has spawned a variety of languages and language extensions [5, 6, 24, 29]. Uniﬁed Parallel C (UPC) [28] combines PGAS semantics with the performance of C and is a widely utilized language within the PGAS community. Although sometimes viewed as “C with classes”, C++ is a multi-paradigm language that supports a variety of modern software engineering techniques. While C++ currently lacks language support for PGAS semantics, techniques such as generic programming allow domainspeciﬁc embedded languages (DSELs) to be implemented as libraries without loss of performance. The PGAS semantics in a sense are a DSEL with respect to C++. We have developed a library-based implementation of this PGAS DSEL by utilizing a performance-preserving composition of existing libraries that support various styles of parallelism with a new interface layer which maps them to the PGAS DSEL. We call this implementation Partitioned Global Property Maps. Partitioned Global Property Maps are based on the Boost Property Map library [4]. A Property Map is an abstraction of a map (in a general sense) from keys to values. Property maps abstract data access from the underlying storage implementation, as well as encapsulating data access policies (read-only vs. read/write, etc.). The core element of Partitioned Global Property Maps is a polymorphic global property map partitioned across multiple processes. Partitioned Global Property Maps consists of three primary layers. The interface layer makes PGAS-style data access semantics available to the user. The communication layer implements data movement, coalescing, and serialization. Between these two layers exists an addressing scheme

for mapping data to processes and a mechanism for caching and maintaining the consistency of remote data. These three layers are combined using the generic programming paradigm to map abstract interfaces to efﬁcient, concrete implementations at compile time. This approach allows a new communication layer or caching strategy to be easily incorporated into Partitioned Global Property Maps. Additionally, the generic design enables the performancepreserving composition of Partitioned Global Property Maps with third-party parallel libraries and run time systems such as OpenMP [1], TBB [12], PFunc [15], and others. Our library based solution has the additional beneﬁts of being ﬂexible and extensible to explore new features that may be useful in the context of the PGAS model. In particular, extending the PGAS semantics provided by Partitioned Global Property Maps is straightforward because it does not require modifying a compiler as with language-based approaches. Because the data access semantics are abstracted from the communication layer, porting the library to new platforms only requires re-implementing or modifying the underlying communication layer. Section 2 brieﬂy explains the generic programming approach employed in our design while Section 3 explains the design in detail. We evaluate the performance of Partitioned Global Property Maps in Section 6. We review a variety of related work in Section 7 and present some conclusions and future directions in Section 8.

2.

Generic Programming

Generic programming is a methodology to design software libraries that are as general as possible with the goal of allowing maximal code re-use. The term “generic programming” is perhaps somewhat misleading because it is about much more than simply programming per se. Fundamentally, generic programming is a systematic approach to classifying entities within a problem domain according to their underlying semantics and behaviors. The attention to semantic analysis leads naturally to the essential (i.e., minimal) properties of the components and their interactions. Basing component interface deﬁnitions on these minimal requirements provides maximal opportunities for re-use. Several examples exist of generic programming in C++; the most well known example is the Standard Template Library [2, 27], part of the C++ Standard Library [13]. Generic programming works by deﬁning algorithms that are parametrized on the types of input arguments they accept. Each algorithm is lifted: the minimal constraints (most generic implementation) on its inputs are found. It is then implemented to allow any argument types that satisfy those constraints. The PGAS model is the result of a lifting process on the sequential execution model where the requirement that data exist in a single shared address space is removed. Specialization can be used to take advantage of type-speciﬁc information to enable optimal performance. Due to special-

ization and the sophistication of C++ compilers, generic algorithms can typically be instantiated with comparable performance to non-generic versions. The basic idea in generic programming is that of a concept. A concept deﬁnes a set of requirements on a type or set of types; they allow common requirements to be reused across many algorithms. Algorithms then impose requirements that their argument types satisfy the requirements of a given set of concepts. Types then model the concept by providing the components needed for a given concept; for example, they can deﬁne particular functions or types. In C++, the requirements in most concepts can be satisﬁed without any changes to an existing type. External functions and type deﬁnitions can be provided by users of a thirdparty library without modifying its source code. Unlike some competing approaches, the generic programming approach does not require any modiﬁcations to existing data structures, and does not require cumbersome wrapper classes. In the STL, for example, normal C/C++ pointers are models of the iterator concepts, even though they are primitive types. Previous work has illustrated the effectiveness of generic programming in allowing the compositional construction of parallel libraries [9]. This same approach allows us to implement Partitioned Global Property Maps in a manner that is agnostic with respect to the underlying manner in which parallelism is expressed as well as the fashion in which parallel execution contexts communicate. Divorcing the method of achieving parallelism from the interface layer allows us to perform a variety of optimizations in a portable fashion.

3.

Design

Partitioned Global Property Maps are designed as a distributed shared memory (DSM) framework with the ability to differentiate between local and remote data where necessary. The keys in a Partitioned Global Property Map are models of the Global Descriptor concept. A global descriptor represents an entity that is owned by some process and may reside in an address space not accessible to the currentlyexecuting process. The global descriptor consists of two parts: the owner of the entity, which is the identiﬁer of the process in which the entity resides, and a local descriptor, that uniquely identiﬁes the entity within the address space of the owner. Partitioned Global Property Maps use a loose consistency model: reads return a default value until a memory barrier loads up-to-date information, and a write may not take effect until a barrier. A user-deﬁned operation combines conﬂicting writes to the same location. Figure 1 shows the architecture of an application using Partitioned Global Property Maps. Partitioned Global Property Maps utilize the Process Group concept which abstracts the notion of a set of coordinating processes. The Process Group concept is in turn modeled by a particular implementation which utilizes a low-level communication

library such as MPI [20], GASNet [3], or a vendor-supplied library such as IBM’s Deep Computing Messaging Framework (DCMF) [17], IBM’s Low-level Application Programming Interface (LAPI) [26], Myrinet Express (MX) [8], or others. This decomposition allows semantic extensions to be agnostic with respect to the underlying data transport layer.

Process 0

Process 1

Process N-1

SHARED

PRIVATE

Applications

Partitioned Global Property Map

Ghost cells

Local portion of shared data

Process Group

MPI, GASNet, Vendor Communication Libraries, etc.

Fig. 2: Logical organization of data for a Partitioned Global Property Map.

Fig. 1: Partitioned Global Property Map architecture.

sharing a single physical address space, however semantically the Partitioned Global Property Maps interface treats these processes as if they have separate logical address spaces.

Partitioned Global Property Maps lift away the implicit assumption that all mapped values exist in a single logical address space from the original property map library. It provides the same interface functions as the sequential property map library: get() and put() methods for reading and writing respectively. In the case of remote keys, get() and put() are non-blocking calls, the implications of this will be explored in Section 3.5. A synchronize() method acts as a memory barrier across all processes that groups accesses to a Partitioned Global Property Map into epochs and enforces completion of operations performed during an epoch at the end of that epoch. Two additional functions, request() and cache() are also provided for clarity and performance reasons which are explained in Section 5. Partitioned Global Property Maps is implemented as an adaptor on top of the sequential property map library and consists of the following elements: • An addressing scheme (i.e., a method of mapping keys to

logical address spaces) • Storage for local values • Cached copies of remote values • A method of copying values between logical address

spaces • Methods of resolving conﬂicting writes

The execution model for Partitioned Global Property Maps is that of concurrent processes with separate logical address spaces. These processes may be mapped onto a variety of system-level parallel constructs, including threads

3.1

Addressing

In order to allow the underlying storage for a Partitioned Global Property Map shared by all processes to be composed of disjoint pieces of memory from multiple address spaces, a method of mapping keys to address spaces must exist, we call this the Distribution of the Partitioned Global Property Map. For performance reasons the distribution is assumed to be static and simple to compute, though extensions to allow distributions that utilize dynamic addressing are possible. In the simplest case, the distribution is nothing more than a hash function. This distribution is assumed to be consistent across all processes and a process must be capable of resolving any key to an address space. 3.2

Local Storage

Partitioned Global Property Maps re-uses the sequential property map library to store local values for two reasons. First, this design allows any improvements or extensions to the sequential library to be included in Partitioned Global Property Maps automatically. Second, this allows existing algorithms which use the sequential property map interface to operate directly on the local portion of the Partitioned Global Property Map with no modiﬁcations. 3.3

Caching Remote Values

For values associated with remote keys, a Partitioned Global Property Map maintains a set of ghost cells that are automatically created for any put() or get() request and store the most recent value known to this processor. The methods in which these ghost cells are stored and managed can be

controlled by the application. The simplest scheme is unbounded growth of the ghost cell set which is stored using an associative container such as a std::map. More complex schemes involve caching strategies of varying associativities combined with an upper limit on the size of the ghost cell set and eviction policies such as least-recently-used. Figure 2 shows the logical organization of data for a Partitioned Global Property Map. Data for keys local to a given process are stored in a memory region that is shared amongst all processes. Additionally, processes can maintain cached copies of remote data in their local memory region. While contiguous storage is shown, Partitioned Global Property Maps places no restrictions on how local data or ghost cells are stored, the types of the underlying storage containers are template parameters. The global and local memory regions are logical constructs and do not imply that direct remote memory access is possible, the actual method by which remote data is manipulated is not speciﬁed and depends on the process group implementation. 3.4

Remote Data Access

The Process Group is a concept that speciﬁes the behavior of the communication layer but not its precise type or implementation. Process group implementations must provide send() and receive() functions to move key-value pairs between processes as well as a synchronize() operation which functions as a distributed memory fence and enforces the delivery of and if necessary, reply to, all remote memory operations. Process groups may optionally provide additional features such as message coalescing, early delivery (before synchronize() is called) of remote memory accesses via callbacks registered by a Partitioned Global Property Map, and collective operations. Figure 3 illustrates the process group concept taxonomy, containing both sharedmemory and distributed-memory process group concepts. Each process group in the taxonomy represents a particular method of modeling the Process Group concept. Multiple Partitioned Global Property Map instances can share a single process group. 3.5

Conﬂicting Write Resolution

Two types of potential data consistency issues exist in Partitioned Global Property Maps. First, when a process accesses a remote key via get(), a message is generated to request the value associated with that key from the owning process. Because get() is a non-blocking call, it must return some value prior to the receipt of the current value as determined by the owning process. If a locally-cached copy of the requested value exists in the ghost cell set it will be returned, however even if no locally-cached copy of the requested value exists some value must be returned. While it is possible for a process to use the distribution to determine if a key is local or remote, there are some cases where a sensible default value can be speciﬁed and avoid the need to perform this determination. One example would be a shortest-paths

computation on a graph where the distance to remote vertices can be speciﬁed to be inﬁnity if a lower distance is not known. A sentinel value can also be returned by default to automate the process of determining whether a key is local or remote if such a value is available. The second possible consistency issue occurs when multiple processes write to the same key during an epoch. Because there is no ordering of operations within epochs, some method of resolving multiple writes to the same key must be speciﬁed. This may consist of allowing an arbitrary write to succeed, or combining the writes using a user-speciﬁed reduction operation such as std::plus(), std::min(), etc. The desired behavior depends on the semantics of the data being stored in a Partitioned Global Property Map and can thus be speciﬁed separately for each Partitioned Global Property Map instance.

4.

Implementation

The features described are sufﬁcient to implement basic PGAS semantics, i.e., the ability to read and write primitive data types in a Partitioned Global Property Map distributed across multiple address spaces. The process group utilized for communication, addressing scheme (GlobalMap), local storage container (StorageMap), and method of caching remote values (GhostCellStorage) are provided as type parameters to the Partitioned Global Property Map using the following class template: template class distributed property map In this class template the GlobalMap is a property map which is typically read-only and maps keys in a Partitioned Global Property Map to the owning address space. This does not imply that each process must explicitly store the location of every element in the Partitioned Global Property Map. The common case is that this property map is a wrapper around a statically computable distribution, however the interface allows any underlying implementation provided that each process can compute the location of every element in the Partitioned Global Property Map. A variety of static distribution objects are supplied with Partitioned Global Property Maps: Uniform block uniformly distributes contiguous blocks of keys across processes. Uneven block distributes contiguous blocks of keys across the processes with non-uniform block sizes. Block Cyclic distributes uniform blocks of keys across processes in a cyclic fashion. Random Block distributes uniform blocks of keys across processes in a random fashion.

Process Group Shared Memory

Distributed Memory

Locking Process Group

Messaging Process Group

Lock-free Process Group

BSP Process Group

Batch Process Group

Immediate Process Group

MPI Process Group

Fig. 3: Process Group taxonomy. Figure 4 shows an example of a program fragment utilizing a Partitioned Global Property Map.

5.

Extensions

In addition to reading and writing primitive data types, Partitioned Global Property Maps has a number of extensions that allow more complex behavior. The simplest of these is an additional pair of interface functions. The ﬁrst, request (), is merely a get() with no return value. This makes the common operation of requesting an up-to-date copy of a remote value more explicit. The second additional interface function, cache(), stores a value for a non-local key in the local ghost cell set but does not send this value to the owning process. This can be useful for reducing network communication by storing intermediate values locally and performing reductions on the sender where possible. A more complex set of extensions to the basic PGAS semantics is contained in the conﬂicting write resolution functionality of Partitioned Global Property Maps. This functionality is encapsulated in a run time assignable reducer function object. This reducer object takes two values, performs a user-deﬁned reduction, and returns the result. When invoked with no arguments, it can also (optionally) return the default value for non-local keys. This relatively simple functionality in combination with the fact that the values stored by a Partitioned Global Property Map need not be primitive data types enables more complex functionality. Figure 5 shows an example of a reducer function object which appends values to a Partitioned Global Property Map that stores a container supporting insert() (such as std:: vector) for each key. One-sided accumulate operations such as those deﬁned in the MPI 2 standard [21] are also easily implemented with custom reducers on singular data types.

Depending on the semantics that are appropriate for a given Partitioned Global Property Map, remote put() operations can be buffered until the end of an epoch and explicitly received, or applied as they are received. In the latter case the Partitioned Global Property Map uses a trigger interface in the process group to register callbacks which are used to handle remote updates as they arrive. This mechanism can also be used to implement ad-hoc active messages and other asynchronous behavior. The process group will invoke the registered callback once per remote get()/put() received and then free the message buffer. This reduces memory consumption relative to buffering all messages until the end of an epoch. 5.1

Consistency Models

The ﬁnal set of extensions to basic PGAS semantics contained in Partitioned Global Property Maps controls the way non-local data is managed at the conclusion of an epoch or when ghost cells are evicted. Partitioned Global Property Maps allows an upper bound to be placed on the number of remote values cached in ghost cells in order to provide a predictable memory footprint. The maximum number of ghost cells is speciﬁed by set max ghost cells() and the eviction policy utilized when this limit is exceeded is contained in the GhostCellStorage class supplied to the Partitioned Global Property Map. Six individual consistency model options exist, and they may be composed into arbitrary combinations: Forward consistency is the traditional model in which updates (via put()) are sent to the owner of the key being written. Backward consistency speciﬁes that once the owner applies the reduction operator and determines the ﬁnal result of all writes to a given key that occurred during an epoch, this ﬁnal value is propagated back to all processes that have local

// A process group communicating over MPI mpi process group pg; mpi process group::process id type pid = process id(pg); // MPI Comm rank mpi process group::process size type P = num processes(pg); // MPI Comm size

template struct append reducer { // Return the default value of a remote key template ValueType operator()(const Key&) { return ValueType(); }

// Property map storage for local keys boost::shared array property map storage map(10); // Use default ghost cell storage, // global map maps keys to processes distributed property map dpmap(pg, global map, storage map); // Make sure that dpmap is initialized on all // processes before remote operations occur synchronize(dpmap); // Shift the values in dpmap around a ring for (size t i = pid∗10 ; i < (pid + 1)∗10 % P∗10 ; i = i + 1 % P∗10) put(dpmap, i + 10 % P∗10, get(dpmap, i)); // Wait for the above operations to complete synchronize(dpmap); Fig. 4: Example Partitioned Global Property Map usage. copies of this key cached in ghost cells. This functionality acts as a broadcast and is useful for communication patterns where the owner of data has knowledge of which other processes need that data and thus can push this data in a one-sided fashion. Bidirectional consistency is the combination of the forward and backward consistency models. This in effect transforms the reduce-on-write model to an all-reduce, where all processes observe the result of the reduction on data they have written as well as receiving updated copies of any other ghost cells. Flush speciﬁes that values are put() to the owning process when they are evicted from a ghost cell set before being discarded. This ensures that no writes are lost when the size of the ghost cell set is bounded and cached values are modiﬁed locally. This allows write-back behavior rather than write-through as with forward consistency. Reset speciﬁes that all ghost cells are returned to the default value at the conclusion of an epoch. This eliminates the need to re-allocate the same ghost cells every epoch in cases where the working set is static, but where the values from a previous epoch are no longer valid.

};

// Combine two values x and y and return the result template ValueType operator()(const Key&, const ValueType& x, const ValueType& y) { ValueType z(x.begin(), x.end()); z.insert (z.end(), y.begin(), y.end()); return z; }

Fig. 5: Example of a reducer which appends to a container.

Clear speciﬁes that ghost cell sets are emptied and all ghost cells deleted at the conclusion of an epoch. In unstructured applications where data from previous epochs is unlikely to be re-used this reduces memory utilization. None If none of the above consistency model options are speciﬁed the Partitioned Global Property Map will perform no remote communication and instead act as a local cache for remote keys. This option is useful where updates to remote data are handled through another mechanism as in the case where atomicity is required for groups of operations rather than single memory accesses. 5.2

Extensible Semantics

Partitioned Global Property Maps implements basic PGAS semantics at its core but, due to a layered, ﬂexible, and extensible design, allows the implementation of more complex behavior as well. Some of the extensions described are merely syntactic sugar on top of basic PGAS semantics, while others add completely new functionality. The ability to map keys to arbitrarily complex data types with userdeﬁned reduction operations allows the implementation of functionality beyond simple data movement. Because Partitioned Global Property Maps is implemented as a generic library rather than in a compiler, syntactic extensions are straightforward and implementing new functionality is less complicated than modifying a compiler. This makes Partitioned Global Property Maps an ideal framework in which to explore semantic additions to the PGAS model. Because Partitioned Global Property Maps is a portable, user-level library, functionality implemented within it can be leveraged

1000

100

100 Time [s]

Time [s]

10 10

1 1 Partitioned Global Property Maps UPC

0.1 1

2

4

8 16 32 Number of Nodes

64

128

Fig. 6: Performance of Partitioned Global Property Maps and UPC on the random access benchmark (32MiB of data and 1M random accesses per process). on a variety of platforms without requiring a new compiler for each one.

6.

Partitioned Global Property Maps UPC

0.1

Evaluation

For this paper we evaluate the performance and scalability of Partitioned Global Property Maps compared to perhaps the most closely related PGAS language, UPC. We present performance results on Odin, a 128-node InﬁniBand cluster (Single Data Rate). Each node is equipped with two 2 GHz Dual Core Opteron 270 CPUs and 4 GiB RAM. We ran our experiments with Open MPI 1.4.2, OFED 1.3.1, and UPC 2.10.2. UPC uses GASNet as its underlying communication library while Partitioned Global Property Maps uses the mpi process group which communicates over MPI using point-to-point operations. We use a simple random access benchmark which is very similar but not identical to the HPCC RandomAccess benchmark [19]. We also use a benchmark that is intended to replicate a common task in HPC applications, loading unsorted data in parallel on individual nodes, then moving data elements to the process with which they have afﬁnity according to a permutation vector distributed across all processes. We call this the Parallel IO benchmark. We have observed that multiple MPI processes per node, each with a smaller portion of the total data, is less efﬁcient than a single MPI process with more data when the benchmark in question stresses the local memory subsystem as both our benchmarks do. This effect is likely due to the increased communication caused by partitioning the data into more pieces and the overhead of communicating through MPI to perform work on data that is present in a node’s local memory. Thus, all of our tests used a single MPI process per node. In order to present a fair comparison we also run a single UPC thread per node.

1

2

4

8 16 32 Number of Nodes

64

128

Fig. 7: Performance of Partitioned Global Property Maps and UPC on the parallel IO benchmark (8MiB per process). Figure 6 shows a comparison of Partitioned Global Property Maps and UPC on the simple random access benchmark. The benchmark creates a shared array of size N evenly distributed across all processes and performs M updates per process to the shared array in blocks of 1024 updates followed by a memory barrier. The UPC version of the benchmark uses upc memput() and GASNet for communication. The Partitioned Global Property Maps version serializes the destination address and value to write, coalesces the data into 64kib chunks (this size can be speciﬁed by the user), then sends the coalesced data to the owning process using MPI Isend() where it is deserialized and the write performed. This coalescing does not allow the Partitioned Global Property Maps implementation to buffer more than the permitted 1024 updates per process. Figure 7 shows a comparison of Partitioned Global Property Maps and UPC on the Parallel IO benchmark. This benchmark creates three shared arrays of size N. The data and permutation arrays initially contain indices in [0, N ) in sorted order. Both arrays are then permuted using a permutation algorithm such as that due to Knuth [16]. The data array represents unsorted data such as that loaded from a parallel ﬁle system or other stable storage. The permutation array represents a mapping of data to processes with which it has afﬁnity, the actual values stored are indices in the output array. Afﬁnity may exist for a variety of reasons but one of the more common reasons is to achieve good locality during subsequent computations. The permuted data array is computed as: ∀i ∈ [0, N ) : permuted data[permutation[data[i]]] = data[i] The UPC version of the benchmark uses upc memget() (which like upc memput() is blocking) to retrieve the value of permutation[i] for each index, then writes permuted data [permutation[data[i]]] using upc memput(). The Partitioned

1000

Time [s]

100

10

Boost.Serialization MPI datatype

1 2

4

8 16 32 Number of Nodes

64

128

however these optimizations are currently not implemented. Coalescing buffers could however, be packed once before being sent rather than incrementally as they are ﬁlled. Another performance difference between the two implementations is in memory management. The mpi process group pre-allocates data buffers, but also uses MPI’s memory management functions (MPI Alloc mem() and MPI Free mem()) directly; these functions are slow for some interconnects that require data to be pinned in memory. In a proﬁle of the Δ-Stepping single source shortest path (SSSP) algorithm in the Parallel BGL [10], another library which uses the mpi process group, 64 nodes solving the SSSP problem on a 227 -vertex, 229 -edge graph spend 29% of their runtime in MPI Pack() and related functions. 31% of their runtime in the same proﬁle is spent in MPI memory management functions.

Fig. 8: Comparison of simple application using Boost.Serialization vs. MPI data types (8MiB per process, 100 iterations).

7.

Global Property Maps version of the benchmark ﬁrst requests the values of permutation[data[i]] for all elements of data local to a process, performs a memory barrier, then writes permuted data[permutation[data[i]]] followed by another memory barrier. Message coalescing leads to better performance by allowing all the reads of permutation to be performed in parallel, as well as the writes to permuted data. The same approach could be used in the UPC version by utilizing asynchronous one-sided operations and explicit caching but Partitioned Global Property Maps makes the caching required by these asynchronous operations transparent to the user. The random access benchmark shown in Figure 6 displays signiﬁcant overhead in Partitioned Global Property Maps as compared to UPC. A large portion of this overhead is due to the serialization performed by the process group that Partitioned Global Property Maps uses for communication. Figure 8 shows a comparison of a simple MPI application which uses asynchronous send/receive operations (MPI Isend()/MPI Irecv()) between every pair of processes to communicate ﬁxed-size blocks of contiguous data in a loop. One version uses the Boost.Serialization library [4] to serialize data incrementally while the other uses MPI datatypes. The mpi process group implements a custom Boost.Serialization archiver that uses MPI Pack() to append to a data buffer, leading to extra function calls and data structure traversals. This approach can send more general objects than MPI datatypes and makes serializing arbitrarily complex types simple; however, it is signiﬁcantly slower than utilizing MPI datatypes. Using generic programming it is possible to specialize the process group to use MPI datatypes rather than Boost.Serialization when only primitive data types are being communicated or even skip serialization all together when running on a homogeneous system,

Several languages have been extended to support PGAS semantics. Uniﬁed Parallel C (UPC) [28] and Co-Array Fortran [25] provide extensions which support PGAS semantics in their respective host languages. Like the Partitioned Global Property Map, UPC provides functionality beyond basic PGAS semantics for reading and writing shared data; collective and asynchronous operations are also provided. These language extensions are both implemented via customized compilers. Java has also been extended to support the PGAS model as part of the Titanium [29] project. A family of new languages designed in whole or in part to support the PGAS model also exist. These include X10 [6] and Chapel [5], amongst others. X10 and Chapel are also explicitly designed with support for asynchrony using features such as remote spawning of tasks, conditional atomic blocks, and futures. The trigger interface in the process group can be used to implement asynchronous behavior however, the progress model in the mpi process group is coarse-grained and may be sub-optimal for applications with ﬁne-grained dependency structures. Process groups could be implemented which have a ﬁner-grained progress model. Libraries which implement PGAS semantics also exist, Global Arrays [23] is a library which provides portable support for global shared arrays on NUMA architectures. Global Arrays uses ARMCI [22] as its remote memory access library. ARMCI provides some aggregation of remote memory operations in a similar but more restrictive manner than the message coalescing performed by Partitioned Global Property Maps’ process group. Partitioned Global Property Maps provides similar functionality with regard to data access as Global Arrays, however Partitioned Global Property Maps also includes built-in functionality for caching nonlocal values as well as user-speciﬁed consistency models for shared data. It would be possible to implement a process group that utilized ARMCI or another remote memory access (RMA) implementation such as the one-sided opera-

Related Work

tions speciﬁed in MPI 2 [21] for data movement, however depending on the semantics of the RMA implementation utilized, implementing the asynchronous triggers provided by the mpi process group may be difﬁcult. Adsmith [18] is an object-based DSM system with an addressing scheme similar to Partitioned Global Property Maps. However, Adsmith uses release consistency and only implements load-store semantics whereas Partitioned Global Property Maps implements richer semantics for manipulating shared objects and enforces consistency without explicitly acquiring and releasing objects. Object-based DSM extensions to Aleph [11] also exist [14], however object-based DSM systems which do not differentiate between local and remote data make expressing locality difﬁcult. Partitioned Global Property Maps makes expressing locality explicit so that programmers can structure their programs in a fashion that exploits application locality to improve performance. The global associative tuplespace in Linda [7] is another example of an object-based DSM which makes expressing data locality difﬁcult.

8.

Conclusion

Partitioned Global Property Maps is a library for implementing programs expressed using the PGAS model in C++. Generic programming is used to decouple data access from the underlying communication infrastructure and allow arbitrarily complex data types to be stored and manipulated. The fact that Partitioned Global Property Maps is a library rather than a compiler allows the semantics to be easily extended. In addition to basic PGAS semantics, ﬂexible caching of remote data and user-deﬁned consistency models are provided. A trigger interface in the underlying process group which controls communication allows arbitrary asynchronous operations to be easily implemented. Because it is a generic library, Partitioned Global Property Maps can be composed with other parallel libraries and run time systems to provide PGAS-style data manipulation alongside other kinds of parallelism. There are several avenues for extensions to Partitioned Global Property Maps. An implementation of the mpiprocess group that uses MPI datatypes rather than Boost.Serialization would trade some ﬂexibility in the data types that could be sent for increased performance. Additional process groups could be implemented which utilize communication libraries capable of more directly leveraging hardware support or whose interfaces map more directly to the PGAS model. Process groups optimized for short messages as opposed to larger coalesced messages may improve performance for applications that perform frequent synchronization operations. Acknowledgements This work was supported by a grant from the Lilly Endowment. The authors would also like to thank Jeremiah Willcock for useful discussions.

References [1] OpenMP. http://www.openmp.org/. [2] M. H. Austern. Generic programming and the STL: Using and extending the C++ Standard Template Library. Professional Computing Series. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1998. [3] D. Bonachea. GASNet speciﬁcation, v1.1. Technical report, University of California at Berkeley, Berkeley, CA, USA, 2002. http://gasnet.cs.berkeley.edu/ CSD-02-1207.pdf. [4] Boost. Boost C++ Libraries. http://www.boost.org/. [5] D. Callahan, B. L. Chamberlain, and H. P. Zima. The Cascade high productivity language. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 52–60, April 2004. [6] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Object-Oriented Programming, Systems, Languages, and Applications, pages 519–538, New York, NY, USA, 2005. [7] D. Gelernter. Generative communication in linda. ACM Transactions on Programming Languages and Systems, 7(1):80–112, 1985. [8] P. Geoffray. Myrinet eXpress (MX): Is your interconnect smart? In High Performance Computing and Grid in Asia Paciﬁc Region, pages 452–452, Washington, DC, USA, 2004. IEEE Computer Society. [9] D. Gregor and A. Lumsdaine. Lifting sequential graph algorithms for distributed-memory parallel computation. In Object-Oriented Programming, Systems, Languages, and Applications, pages 423–437, October 2005. [10] D. Gregor and A. Lumsdaine. The Parallel BGL: A generic library for distributed graph computations. In Workshop on Parallel Object-Oriented Scientiﬁc Computing, July 2005. [11] M. Herlihy. The Aleph toolkit: Support for scalable distributed shared objects. In Workshop on Communication, Architecture, and Applications for Network-based Parallel Computing, Jan. 1999. [12] Intel. Threading Building Blocks. http://www.intel.com/ cd/software/products/asmo-na/eng/294797.htm, August 2006. [13] International Organization for Standardization. 14882:1998: Programming languages — C++. Switzerland, Sept. 1998.

ISO/IEC Geneva,

[14] R. Kaiabachev and B. Richards. Java-based DSM with objectlevel coherence protocol selection. In International Conference on Parallel and Distributed Computing and Systems, pages 648–653, Marina del Ray, CA, USA, November 2003. [15] P. Kambadur, A. Gupta, A. Ghoting, H. Avron, and A. Lumsdaine. PFunc: Modern task parallelism for modern high performance computing. In Supercomputing, Portland, Oregon, November 2009.

[16] D. E. Knuth. The Art of Computer Programming II: Seminumerical Algorithms. Addison–Wesley, Reading, Massachusetts, second edition, 1981. [17] S. Kumar, G. Dozsa, G. Almasi, P. Heidelberger, D. Chen, M. E. Giampapa, M. Blocksome, A. Faraj, J. Parker, J. Ratterman, B. Smith, and C. J. Archer. The Deep Computing Messaging Framework: Generalized scalable message passing on the Blue Gene/P supercomputer. In International Conference on Supercomputing, pages 94–103, New York, NY, USA, 2008. ACM. [18] W.-Y. Liang, C. ta King, and F. Lai. Adsmith: An efﬁcient object-based distributed shared memory system on PVM. In International Symposium on Parallel Architectures, Algorithms, and Networks, pages 173–179, Los Alamitos, CA, USA, 1996. IEEE Computer Society. [19] P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas, J. Kepner, J. Mccalpin, D. Bailey, and D. Takahashi. Introduction to the HPC challenge benchmark suite. Technical Report ICL-UT-05-01, ICL, 2005. [20] Message Passing Interface Forum. MPI, June 1995. http: //www.mpi-forum.org/. [21] Message Passing Interface Forum. http://www.mpi-forum.org/.

Mpi-2, July 1997.

[22] J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. In Parallel and Distributed

Processing, Lecture Notes in Computer Science, pages 533– 546. Springer-Verlag, 1999. [23] J. Nieplocha, R. J. Harrison, and R. J. Littleﬁeld. Global Arrays: A portable “shared-memory” programming model for distributed memory computers. In Supercomputing, pages 340–349. ACM Press, 1994. [24] R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1–31, 1998. [25] R. W. Numrich and J. Reid. Co-arrays in the next Fortran standard. SIGPLAN Fortran Forum, 24(2):4–17, 2005. [26] G. Shah and C. Bender. Performance and experience with LAPI—a new high-performance communication library for the IBM RS/6000 SP. In International Parallel Processing Symposium, page 260, Washington, DC, USA, 1998. IEEE Computer Society. [27] Silicon Graphics, Inc. SGI Implementation of the Standard Template Library, 2004. http://www.sgi.com/tech/ stl/. [28] UPC Consortium. UPC Language Speciﬁcation, v1.2, May 2005. http://upc.lbl.gov/docs/user/upc_spec_1.2. pdf. [29] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilﬁnger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. In Workshop on Java for High-Performance Network Computing, New York, NY, USA, 1998. ACM Press.

Lihat lebih banyak...

Extensible PGAS semantics for C++

Descrição do Produto

Comentários