Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems

August 25, 2017 | Autor: Diego Andrade | Categoria: Data Structure, Separation Logic, Data Structures, Multi-threading, Language Extension

Share Embed

Denunciar este link

Descrição do Produto

Task-parallel versus data-parallel library-based programming in multicore systems Diego Andrade, Basilio B. Fraguela University of A Coru˜na, Spain {dcanosa,basilio}@udc.es

Abstract—Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of programming multicores from two different perspectives, task parallelism and data parallelism. The Intel Threading Building Blocks (TBB) library separates logical task patterns, which are easy to understand, from physical threads, and delegates the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with a block-recursive nature following a data-parallel paradigm. Our comparison considers both ease of programming and the performance obtained using both approaches. In our experience, HTA programs tend to be smaller or as long as TBB programs, while performance of both approaches is very similar.

I. I NTRODUCTION Processor manufacturers are building systems with an increasing number of cores. These cores usually share the higher levels of the memory hierarchy. Many language extensions and libraries have been developed to ease the programming of this kind of systems. Some approach the problem from the point of view of task parallelism. The key notion is that the programmer has to divide the work into several tasks which are mapped automatically onto physical threads that are scheduled by the system. Task-parallelism can be implemented using libraries such as POSIX Threads [1] which provide minimal functionality and for this reason some consider this approach the assembly language of parallelism. Recently, the Intel Thread Building Blocks (TBB) library [2] has appeared, which allows expressing parallelism using a higher level of abstraction. A third task-parallel API is OpenMP [3] which is mainly based on simple compiler directives used to guide mostly the parallelization of regular loops although the recently proposed extension [4] with a task-enqueuing mechanism extends its scope of application. The task of writing parallel programs can also be faced from the point of view of data parallelism, where tasks perform the same operation on different pieces of data. Thus, data-based parallelism is a subset of control or task parallelism, as in task-based parallelism tasks can perform the same or different operations on the same or different pieces or data. Parallel array computation has been widely used since the appearance of the first array and vector computers [5] , playing a key role in the expression of parallelism in many languages such as High Performance Fortran [6] and its variants [7],

James Brodman and David Padua University of Illinois at Urbana-Champaign, USA {brodman2,padua}@uiuc.edu

ZPL [8], X10 [9], and others. The historical experience in the attempts to implant new languages with a focus on parallelism, coupled with the large base of existing legacy codes, has made many researchers think that macros, and more in general, libraries, are a better vehicle to bring parallelism to mainstream computing. This way, libraries such as POOMA [10] or POET [11] have explored this possibility. More recently, the Hierarchically Tiled Array (HTA) library [12], [13] facilitates the writing of data-parallel programs, putting a special emphasis on the concept of tiling [14] both to express locality and parallelism. With the arrival of multicores to all computing systems from embedded ones to supercomputers, the relevance of parallel programming for these systems is enormous. As a result there are large efforts both from industry and academia to improve the productivity in the development of parallel codes as well as the resulting performance. In this context we have found of interest a comparison between task and data parallelism. Our aim is to explore in the scope of multicores the impact of following either approach on productivity and performance, and to which point the additional restrictions of data-parallelism may discourage its usage to parallelize certain problems. We have chosen as representatives of both approaches the Intel TBB library [2], which enables writing programs based on task parallelism, and the Hierarchically Tiled Array (HTA) library [12], [13], which facilitates the implementation of data parallel programs. As far as we know, the HTA library is the most up-to-date library specifically designed following a data-parallel paradigm with a good support for multicore systems. Among the large variety of approaches that allow task-parallel programming for these systems, the TBB library is the one that most closely resembles the HTA in terms of level of abstraction and interface provided to the programmer, thus enabling a fair comparison of both approaches. Another reason for choosing TBB is that, since the HTA library for shared memory is implemented on top of the TBB library, performance differences between both libraries can be more clearly attributed to the specific way in which they lead programmers to write their applications. This paper is organized as follows. Sections II and III summarize the main features of the HTA and TBB libraries, respectively. In Section IV a high-level description of the algorithms used for the comparison is presented and the implementation both with TBB and HTAs is discussed briefly.

Fig. 1.

Creation of a HTA

Section V describes the main differences between the TBB and HTAs libraries. We will illustrate how data and task parallelism face the same problems using different approaches. Section VI compares quantitatively the codes written using both libraries, and Section VII presents the conclusions. II. T HE HTA

LIBRARY

The Hierarchically Tiled Array (HTA) [12], [13] is an array data type which can be partitioned into tiles. Each tile can be either a conventional array or a lower level HTA, thus HTA is a recursive data type. HTAs adopt tiling [14] as a first class construct for array-based computations. Tiles have been used to express data parallelism [8], [6], [15], [16] and to improve the locality of the accesses [17]. Thus HTAs empower programmers to control data distribution and the granularity of computation explicitly through the specification of tiling. This feature makes the library more versatile and provides maximal control to the programmer, allowing tuning the tiling to achieve the best performance, as we will see in Section VI. Of course, the tiling used for the HTAs can be actually obtained by an automatic method that tries to choose the best one; this is out of the scope of our work. Figure 1 shows the operations needed to create an HTA with 3 tiles of 4 elements each. The variable tiling, defined in line 1, specifies the number of elements or tiles for each dimension and level of the HTA, from the bottom to the top of its hierarchy of tiles. The alloc operation in the second line creates the HTA. The number of levels of tiling is passed as the first parameter to alloc. The tiling structure is specified by the second parameter. The third parameter selects the data layout, which can be row mayor (ROW), column major (COLUMN) or TILE, in which the elements within a tile should be stored by rows and in consecutive memory locations. The layout across tiles is always row major. The data type and the number of dimensions are template parameters of the HTA class. While the tiling structure of an HTA is specified at creation time, it can be modified dynamically by adding or removing partition lines, the abstract lines that separate the tiles in an HTA. This generates new tiles or merges existing ones, respectively, in a process known as dynamic partitioning. HTA indices are zero-based. Tiles or scalars of HTAs can be selected using n−Tuples. In this explanation the n−Tuple notation is substituted by a list of integers (x, y, . . .). HTAs can also be indexed using Ranges of the form low : step : high. The list of integers and ranges can be enclosed by the () operator, which selects tiles, or by the [] operator, which selects scalar elements of the HTA. For example h(1)[2] yields

element [2] within tile (1). If no tile is selected, the [] operator is used to access the HTA as a conventional array. HTAs facilitate parallel programming by providing numerous methods that operate in parallel across tiles. The main constructs are: • hmap: It applies a function to each element of the HTA or corresponding elements of two or more conformable HTAs. • reduce: It performs a reduction, that is, it applies operations on an HTA to produce an HTA of lesser rank. • scan: It computes a prefix operation across all the elements of an array. Let op be an associative operation with left identity element id. The parallel prefix of op on a sequence x0 , x1 , . . . , xn−1 will be another sequence y0 , y1 , . . . , yn−1 where y0 = id op x0 and yi = yi−1 op xi . • mapReduce: performs a reduction on the results of a given operation applied to each element of an statement. It is a mixture of an hmap and a reduce construct following the principles of the map-reduce [13] The four constructs receive at least one argument, a function object (functor) whose operator() method encapsulates the operation to perform on a single tile of the object HTA. In the case of hmap, the function may accept additional HTAs as parameters that must be conformable, that is, have the same tiling structure as the HTA instance on which the hmap is invoked. The functor will handle in this case in each invocation a tile from each one of the HTAs involved. In any case, the functor handles a single tile per HTA in its operator(), so that the indexing of the elements inside the operation is relative to the first position of a tile. Also, the operation on each tile is performed in parallel, at least conceptually. Data-parallelism is enforced by the fact the functors can only write in the tiles received in each single invocation or the predefined outputs of the operation, such as in reductions. Thus, global data out of the specific tiles received in the invocation must be strictly read-only. Unfortunately, C++ does not provide mechanisms that allow the library to preclude programmers from writing to global data outside the tiles in the functors that are applied in parallel. If they do so, they are certainly not following the data-parallel structured approach promoted by HTAs. An interesting feature of HTAs, called overlapped tiling, is the ability to create and manage automatically shadow or ghost regions around each tile that contain a copy of the elements of neighboring tiles. This is particularly useful for stencil codes, which compute new values based on their neighbors, as in the case of a(i) = a(i − 1) + a(i + 1). Finally, it is interesting to notice that, in contrast to the Intel TBB library, which is restricted to shared memory, HTAs support shared, distributed and hybrid memory systems. III. T HE I NTEL TBB

LIBRARY

The Intel Threading Building Blocks (TBB) [2] library was developed by Intel for the programming of multithreaded applications. In this case, it is not necessary to use a special type of data structure. However, the TBB library provides

of special types of containers which can be manipulated concurrently. But this is not the native method to specify parallelism, like in the HTA case. As mentioned above, the TBB library enables the implementation of multithreaded task-parallel programs. It does not base the specification of parallelism on data operations that are inherently parallel, which is the HTA approach. Rather, parallelism is achieved by defining tasks that can be performed concurrently. The task scheduler then maps tasks to available hardware threads. The task scheduler distributes the tasks between the available threads. When there are more threads available than tasks, it can split an existing task in several smaller tasks. A. TBB operations The element-by-element operation, reduction, and scan constructs are implemented in the TBB library using the parallel for, parallel reduce and parallel scan algorithm templates respectively. The TBB library also includes the algorithm templates parallel do, which supports unstructured workloads where the loop limits are not known at the beginning of the loop, and pipeline, which is used when there is a sequence of stages that can operate in parallel on a data stream. parallel reduce and The parallel for, parallel scan algorithm templates accept two basic parameters: a range defining loop limits, and a function object representing the body of the parallel loop. This object overloads the operator() method and defines the operation to be performed on the range assigned. The range is split recursively into subranges by the task scheduler and mapped onto physical threads. The TBB library provides standard ranges, such as blocked range, which expresses a linear range of values in terms of a lower bound, an upper bound, and optionally, a grain size. The grain size is a guide for the workload size per task. The value of granularity affects the performance and load balance of the parallel operation. The TBB library allows creating ad-hoc ranges. That is, the user can define new range classes implementing specific policies to decide when and how to split, how to represent the range, etc. An example of usage of ad-hoc range will be shown in Section IV-C. Some additional features present in the TBB library are a scalable and efficient memory allocator for multithreaded programs, mutual exclusion structures for explicit thread synchronization, support for atomic operations on primitive data types, and thread-aware timing utilities. IV. I MPLEMENTATION OF S OME A LGORITHMS The codes used in this comparison were taken from the chapter 11 of [2], which contains examples of parallel implementations of algorithms using TBB1 . We chose these codes because at the time the experiments were done there were very few codes available written using TBB, and they 1 These codes are in public domain and the can be downloaded from http://softwarecommunity.intel.com/articles/eng/1359.htm

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

class Average { public: float ∗ input, output ; void operator () ( const blocked range& range ) const { for ( int i = range . begin () ; i != range .end() ; ++i ) output [ i ] = ( input [ i−1] + input[i] + input [ i +1]) ∗ (1/3.0 f ) ; } ... }; const int N = 100000; static int nThreads = 4; int main( int argc , char∗ argv[] ) { float raw input[N+2], output[N]; raw input[0] = 0; raw input[N+1] = 0; float ∗ padded input = raw input + 1; ... /∗ Initialization not shown ∗/ task scheduler init init (nThreads) ; Average avg(padded input, output ) ; parallel for ( blocked range( 0, N, 1000 ), avg ); return 0; }

Fig. 2.

TBB implementation of the Average algorithm

cover a diversity of parallel computation patterns. This section describes them and highlights the key differences between the TBB and HTA implementations using some snippets of code. A. Average This algorithm calculates, for each element in a vector, the average of the previous element, the next element and itself and the result is the stored in an output vector. It can be parallelized by the TBB library using the parallel for construct. The TBB code that implements this algorithm is shown in Figure 2. In this code, the first and the last elements of the array are special cases, since they don’t have previous and next elements, respectively. This is solved by adding elements at the beginning and the end of the array which are filled with zeros as shown in lines 15-18 of the code. In line 20, the task scheduler object is created and initialized with 4 threads. The task scheduler is the engine in charge of mapping tasks to physical threads and of the thread scheduling. It must be initialized before executing any TBB parallel constructs. The first argument of the parallel for in line 23 is a range which includes the whole vector. A grain size of 1000 is advised in this case. The second argument is an object of the class Average which encapsulates the operation to be executed by the parallel for. This class is defined in lines 1 thru 9. The operator() method in this class defines the operation that will be applied on each subrange. The low and high values of the indexes for each subrange are directly extracted from the range parameter using the begin() and end() methods (see line 5). The HTA implementation of this algorithm is shown in Figure 3. The data structures are created in lines 17-20. The padding values are automatically generated and filled in HTA input thanks to overlapped tiling. Line 18 defines an object that describes the overlapping of tiles in input. The first two arguments of the constructor specify that shadow regions have size one in both the positive and negative direction. This constructor allows a third optional argument to specify whether the boundary region built around the array is filled with zeros, which is default behavior when nothing is specified, or it

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

typedef HTA HTA 1; # define T1(i) Tuple(i); struct Average { void operator () (HTA 1 input , HTA 1 output ) const { for ( int i = 0; i != input . shape () . size () [0]; ++i ) output [ i ] = ( input [ i−1] + input [i] + input [ i +1]) ∗ (1/3.0 f ) ; } }; const int N = 100000; static int nTiles = 4; int main( int argc , char∗ argv[] ) { Traits::Default::init ( argc , argv ) ; Seq< Tuple > tiling( T1(N / nTiles), T1(nTiles) ); Overlap ol ( T1(1), T1(1) ) ; HTA 1 input = HTA 1::alloc(1, tiling , ol , NULL, ROW); HTA 1 output = HTA 1::alloc(1, tiling , NULL, ROW); ... /∗ Initialization not shown ∗/ input .hmap(Average(), output ) ; return 0; }

Fig. 3.

1 for ( int i =1; i max size) { 12 max size = k; max pos = j; 13 }}} 14 15 max array[i ] = max size; 16 pos array [ i ] = max pos; 17 }} 18 ... 19 }; 20 ... 21 parallel for (blocked range(0, to scan.size () , 100), 22 SubStringFinder( to scan , max, pos ) ) ; 23 ...

(a) TBB version 1 ... 2 if ( input1 size > GRAINSIZE) { 3 size1 = input1 .shape () . size () [0]; 4 size2 = input2 .shape () . size () [0]; 5 6 if ( size1 < size2 ) { 7 h2=input1 ; h1=input2 ; 8 std::swap ( size1 , size2 ) ; 9 } else { 10 h1=input1 ; h2=input2 ; 11 } 12 13 int pos = h2.lower bound(h1[(size1 − 1) / 2]) ; 14 15 h1. part ( ( size1 − 1) / 2 ) ; 16 h2. part ( pos ) ; 17 output . part ( pos + ( size1 − 1) / 2 ) ; 18 19 output . hmap(Merging(), h1, h2, 0) ; 20 ... 21 } else { 22 ...

(b) HTA version Fig. 5.

1 struct SubStringFinderOp { 2 void operator () (HTA max , HTA pos ) { 3 ... 4 init i =lower bound 0; 5 end i= init i +max .shape(). size () [0]; 6 7 int pos=0; 8 for ( size t i = init i ; i != end i ; ++i) { 9 int max size = 0, max pos = 0; 10 for ( size t j = 0; j < str. size () ; j++) { 11 if ( j != i ) { 12 int limit = str . size () −( i > j ? i : j ) ; 13 for ( int k = 0; k < limit; ++k) { 14 if ( str [ i + k] != str [ j + k]) break ; 15 if (k > max size) { 16 max size = k; max pos = j; 17 }}}} 18 max [pos] = max size; 19 pos [pos] = max pos; 20 pos++; 21 }}}; 22 ... 23 max.hmap(SubStringFinderOp(),pos); 24 ...

Parallel Merge

(b) HTA version Fig. 6.

variable split. This argument is used by the TBB library to flag a Range constructor that is used to split an input Range in two. The constructor builds a new range that stores one of the halves of the original Range and modifies the original Range, received as first parameter, to hold the other half. This constructor performs the steps described in steps 2-5 of the algorithm. The other constructor is a conventional constructor. The basic operation simply performs the merge sequentially by means of a std :: merge. The HTA version is based on hmap. In the function applied by hmap, if the sequences are bigger than a given threshold, steps 2-5 are implemented. This part of the algorithm, shown in Figure 5(b), is implemented using the dynamic partitioning feature. Lines 15-17 add new partitions to the two input HTAs and, the output HTA in the points selected as described in step 3 of the algorithm. This is performed using the part method, which accepts the position in which a new partition line is to be added, giving place to two separate tiles. The position where key would fall in the second sequence, mentioned in step 4 of the algorithm, is calculated in line 13 using the HTA function lower bound, which returns the index of the first element of the HTA that is equal or larger than its argument. Line 19 calls hmap recursively with the repartitioned structures. In this call, hmap applies its functor argument on each chunk in parallel. After this, these partitions are removed using a method called rmPart. The recursion finishes when the

Substring Finder

sequences to merge are smaller than a given threshold, then step 1 is performed. D. Substring Finder In this code, given a string, for each position in the string, the program finds the length and location of the largest matching substring elsewhere in the string. For instance, take the string flowersflows. Starting the scan at the first character at position 0, the largest match is flow at position 7 with a length of 4 characters. The position and length of those matches are stored for each position of the string. The parallelization strategy consists in searching the largest matching string for each position of the scanned string in parallel. The TBB version uses a parallel for, while the HTA version uses a hmap. The codes, shown in Figures 6(a) and 6(b) are very similar. The operation performed in parallel is the same in both cases, the only difference is the indexing of the data structures, as it happened in previous codes. In the HTA version, the max and pos arrays, where the result will be stored, are divided in tiles, and the hmap operation is applied separately on each tile, so the indexing will be relative to the start position of the current tile.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

... class tbb parallel task { ... static void set values (Matrix∗ source, char∗ dest) { ... m source = source; m dest = dest ; ... } ... void operator () ( const blocked range& r ) const { .... begin=( int ) r . begin () ; end=( int ) r . end() ; Cell cell ; for ( int i=begin; idata, m source−>width, m source−>height,i); }} ... }; ... for ( int counter=1; counter

Lihat lebih banyak...

Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems

Descrição do Produto

Comentários