A New Parallelism Management Scheme for Multiprocessor Systems

June 9, 2017 | Autor: J. Quisquater | Categoria: Signal Processing, Parallel Computation

Descrição do Produto

A New Parallelism Management Scheme for Multiprocessor Systems Xavier Verians1 , Jean-Didier Legat1, Jean-Jacques Quisquater1, and Benoit Macq2 Microelectronics Laboratory,Universite Catholique de Louvain, Telecommunications Laboratory, Universite Catholique de Louvain, place du Levant, 3; 1348 Louvain-la-Neuve; Belgium 1

2

Abstract. Current multimedia and signal processing applications be-

come more complex. This paper proposes a new parallelism management scheme that can explicitly deal with complex and general parallelism patterns. The parallelism description is based on a task ow graph representation interlaced with control commands. A graph management algorithm is proposed to extract eligible tasks and implement synchronization operations. We show that this management outperforms classical parallelism extraction in case of complex applications. Moreover, the parallelism description can be directly inserted in sequential programs without deep code modi cations. keywords: parallelism, task graph, task scheduling, multiprocessors

1 Introduction Many applications have to be parallelized to be implemented on multiprocessor systems. The parallelization task can be split into two steps: the rst step consists in the algorithm parallelization and the second step translates the extracted parallelism into the system parallel syntax. This second part is not trivial and often reduces the available parallelism to respect syntax constraints. More particularly, new multimedia or image processing applications (MPEG4 codec, videophony, . . . ) are based on highly complex structures. Data are divided into small blocks and taking into account some block statistics, a kernel is selected and executed. Consequently, these applications are constituted by kernels linked together by a complex control structure. Therefore, the available parallelism moves from the kernel level where smaller data sets are used, to the control level allowing to simultaneously execute several kernels and statistical computations. Due to the close interaction between parallelism and program control ow, it can only be exploited with a exible description. Finally, loops often are neither completely parallel nor fully sequential. Some loop body parts need to be executed in the loop order while other parts are completely independent. The corresponding task graph is close to an oriented meshed structure (Fig. 1). Usually, this partial parallelism is ignored because classical fork/join commands cannot easily describe such parallelism patterns. In the best cases, loops are unrolled to partially exploit the available parallelism or

Fig. 1. Partially parallel loop (example taken from the LU algorithm). left: pseudo-C loop, right: corresponding task graph. Ai is the ith instance of A. an explicit pipelined implementation is programmed. Both solutions require that the programmer signi cantly modi es the program code to match the parallelism syntax. These observations lead us to study a more general parallelism representation allowing to describe various parallelism patterns and simple enough to be handled at a fair cost by a multiprocessor system. The proposed parallelism management scheme is based on the following characteristics: - A separation of the parallelism description and the main control structures from the computation code. It releases the processing units (PUs) of the parallelism management task. - An original parallelism description with a complexity adapted to the extracted parallelism levels, able to describe general parallelism patterns. Complex parallelism patterns allow to schedule tasks sooner in the process. Moreover, it is possible to extract parallelism from structured sequential programs without major code modi cations to compel with strict syntax. - An original task management algorithm to quickly determine tasks ready to be executed and to perform ecient synchronization operations. The parallelism management has been designed to be implemented on small-scale systems able to execute up to 64 simultaneous tasks. The system architecture is not imposed, provided that a centralized parallelism management and a shared memory are supported. The system can be a (simultaneous) multithreaded processor, where tasks are used to hide remote or memory access latency, or a multiprocessor system with simple RISC or complex superscalar processors. This paper is structured as follows. The following section presents the multiple aspects of our parallelism description. It begins with some classical theoretical considerations before describing the parallelism representation. A study of the parallelism levels is presented in subsection 2.3. The proposed task management algorithm is developed in section 3. For the paper clarity, we suppose in section 2 and subsection 3.1 that the program does not contain branches or loops between tasks. We will show in subsection 3.2 that the generalization to

high-level branches or loops does not modify the task management algorithm and simply requires an additional command interpreter. Benchmarks simulations are presented in section 4. Results show that in applications with massive parallelism, our task management produces similar performances than those of current multiprocessor techniques. However, for applications with more complex parallelism structures, the performance increase rises up to 12%. The last section brie y discusses the original characteristics of the presented system compared with other parallelism management techniques.

2 Parallelism Description 2.1 Theoretical Aspects

Program parallelism can be represented with a task graph. The program is divided into several tasks connected by some dependence links. A task is an instruction set which will be executed on a single processing unit. Tasks can communicate and interact together by using the shared memory. Their size and number of dependences is not limited. They can contain jumps, function calls and conditional branches, provided all jumps are referencing an address included in the same task. A task graph [1] is a directed graph consisting of a set of vertices (tasks) V and a set of arcs (dependences) on the vertices A. If the arc (t ; t ) 2 A, then task T must complete execution before task T can start execution. We say that task T is a preceding task of T . Arcs (t ; t ) are also called incoming arcs of T . The notation used is the following: capitals to design tasks and lower-case letters for arcs. The distance between two tasks d(T ; T ) is de ned as the minimum number of arcs to follow to go from a task T 2 V to both T and T . The task graph de nes a task ordering. This ordering is not strict as parallel tasks do not need to be executed one before another. A task T is eligible for execution only if all tasks T such that (t ; t ) 2 A have been completed. At run-time, the task graph is unrolled to produce a directed acyclic graph (DAG), called dynamic task graph [1]. Due to its limited resources, the processor will only see a slice of this graph, called the task window. The task window holds all tasks already decoded by the system, but waiting for their execution to be launched or completed. i

i

i

j

j

j

i

j

j

i

k

j

i

j

i

j

2.2 Task Program

j

i

Most current systems insert parallelism commands in the program code [2][3]. The processing units execute tasks and, when a parallel command is encountered, they execute it and update the eligible task queue. The time lost by the processing units to handle parallelism commands leads to an overhead becoming signi cant in case of complex parallelism structures. Moreover, the task creation order depends on the task execution, for example, when several connected tasks are created in dierent parallel threads. This can lead to complex protocols to identify synchronization edges between tasks[4].

Fig. 2. left: Task graph example, right: Corresponding task program We propose another approach consisting in separating the parallelism information from the program code. This separate parallelism description, called the task program, is directly sent to the task manager, releasing the processing units from the parallelism decoding. The communications between the task manager and the processing units are reduced. Besides, the parallelism decoding can be performed in parallel with task execution. Finally, as the parallelism description syntax is independent of the task code, the syntax complexity can easily be adapted to the exploited parallelism level. The multiple tasks de ned in the program code create a static dependence graph containing loops and conditional parts. As the task program is also static, it needs to include some commands to implement loops and branches among tasks. At run-time, these commands are decoded to build the correct dynamic task graph. For the paper clarity, we will treat this point later in section 3.2. We will rst suppose that the static task graph is already a directed acyclic graph. The static graph and the dynamic task graph are in this case identical. The dynamic task graph decoding is performed by a sequential reading of the task program. A rst constraint on the task program is that the task order has to be respected, i.e. for each task T , all preceding tasks T are encoded before T in the task program. Consequently, the task program is simply an ordered topological description of the task graph (Fig. 2). It allows to represent all the parallelism patterns which can be drawn as a directed acyclic graph. The syntax is simple: each line describes a task T and its incoming arcs (t ; t). As t is the same for all the incoming arcs, it is not noted. i

j

i

j

2.3 Extended Parallelism Description The limited size of task window can prevent the processor to extract all the available parallelism. The simple topological graph description has to be extended to allow an ecient parallelism extraction. An analysis of multimedia and signal processing applications shows three parallelism levels of increasing complexity: 1) Low-level parallelism: Signal processing applications are often based on small kernels easily parallelizable (DCT, quanti cation, ltering, ...). These ker-

nels can often be described by a single task executed many times. A tag identi es this task in the task program, allowing the task manager to handle it as a simple task. When it becomes eligible, it is attached to a dedicated resource, the task server, storing some loop state bits, the number of processors executing the loop and detecting the loop completion. 2) Medium-level parallelism: It is the parallelism existing between close tasks in the dynamic task graph, close de ned with the above distance de nition. Close tasks will be present at the same time in the task graph, allowing the processor to extract all the available parallelism. The topological graph description is sucient for this level. 3) High-level parallelism: It often arises that multiple processing can be performed in parallel. If each processing is described by a large number of tasks, a sequential decoding cannot provide a comprehensive view on the global parallelism structure. A simultaneous decoding of dierent task graph parts is necessary. High-level commands, such as classical fork/join commands [5], are added to the task program. The key dierence with other systems is that these commands are only used at a high level, where the command management overhead is negligible compared with the execution time of the multiple task sets.

3 Task Management 3.1 Parallelism Extraction Algorithm The main task manager job is to quickly determine the eligible tasks. It is equivalent to identify the entry tasks of the task window graph. Moreover, when a task is completed, it has to remove it from the task window and has to compute the new eligible tasks. We propose an algorithm based on a queue bank. The queue lling is performed in the task decoding order: if T needs to be executed after T completion, it will be added to a queue after T . The lling policy is simple: a task T is added to the queue q only if T = tail(q ) is such as (t ; t ) 2 A. These two rules guarantee that the queues form a graph covering with each queue containing a chain of consecutive tasks. The task graph structure is explicitly conserved in the queue organization: dierent parallel branches are stored in dierent queues. It allows to implement intelligent task scheduling policies based on a dynamic analysis of the task graph. Moreover, thanks to the task ordering, eligible tasks will only be found in the queue heads. The three basic operations on the task window graph are the following: j

i

i

j

k

i

j

k

i

- add task(T): Search a queue q such as tail(q ) = T , with (t ; t) 2 A. If no queue satis es the condition, q will point to an empty queue. Then all the incoming arcs (t ; t) are added to the queue, followed by the task identi er. If no empty queue is available, the decoding waits for the execution to proceed further until a queue is released. k

k

j

k

j

j

- get eligible task(): If a task in a queue depends on other tasks of the task window graph, it is preceded by its incoming arcs. Only eligible tasks are not preceded by arcs nor tasks and are thus stored in the queue head. Finding an eligible task requires only to determine if the head of a queue contains an arc or a task identi er. When selected, the task is removed from the queue. - remove task(T): When a task T completes, all the arcs (t; t ) are removed from the queues. It is the most expensive operation. However, the critical operation is the removal of an arc preventing a new task to become eligible. These arcs are found in the queue heads. Thus the time between a task completion and a dependent task launching is determined by the delay needed to remove the arcs associated with T from the queue heads. Tasks selected from the task graph are sent to an eligible task queue or to a task server in case of a low-level loop. Currently, no task scheduling policy is implemented. As the eligible queue is similar to the eligible queue of classical multiprocessor systems, except that the synchronization is already performed, classical task scheduling techniques can be implemented. j

3.2 Generalization to High-Level Commands

Up to now, we have supposed that the task program was free of loops, branches and high-level parallel commands. We will now introduce these control instructions and show that they only require a small task interpreter to be handled by the task manager. The previous task management algorithm remains valid. High-level commands can be considered as particular tasks executed by the task manager in place of the processing units. If these commands are executed when they are decoded, they are simply inserted in the task program. If they depend on some tasks, they are added to the task graph, modifying the task program consequently. The task graph management is not modi ed. The additional complexity is limited to the command interpreter which remains simple as the number of commands is limited. An example of a static task graph, and the derived task program is presented in gure 3. High-level branches dynamically modify the task graph. They are considered as small tasks executed by the manager to modify the task decoding order. Branches are predicted to speculatively decode the most probable side. A branch tag is attached to the queues to allow a possible invalidation. When a branch becomes eligible, it is resolved. In case of wrong prediction, the tagged queues are invalidated and the decoding starts on the other branch side. The queue invalidation can be performed in parallel with the task decoding. The cost is thus hidden to the processing units. Several additional commands are de ned to improve performance or to reduce some parallelism management operations. The use of backward branches allows several tasks to be executed multiple times. In order to distinguish between the dierent task instances, tasks and arcs are renamed at the decoding stage to guarantee that each instance has a unique name in the task window. As the tasks in the task program are ordered, the arcs (t ; t ) where T has not been renamed are ignored. It guarantees the absence of dependence loops in the execution graph. i

j

i

Fig. 3. left: Static task graph, right: Task program. The fork/join commands are not necessary as the two task graph parts are small. It is only set for example purpose.

4 Simulations The task manager has been simulated in a simple multiprocessor architecture. The processing units are simple RISC processors. As the goal is to measure the eciency of the parallelism extraction, the system has been supposed to have ideal buses and memories. The relative speedup is de ned as the ratio between the measured execution time with the execution time on a single processor. The normalized speedup is de ned as the ratio between the measured execution time with the execution time of an equivalent sequential program executed on a single processor. The compiler used is based on the SUIF compiler system [6], extended with the Harvard's MACHINE library [7]. Several passes have been added to handle the task and parallelism extraction. Figure 4a displays the system performance with SPLASH and SPLASH-2 benchmarks. The overall performance is close to the ideal performance as given in [2][8], except for the LU benchmark where the speedup is increased. These SPLASH applications mainly have simple parallelism structures and the data sets are large enough to hide the parallelism management overheads. The performance increase for the LU benchmark is achieved by exploiting the complex parallelism of the outermost loop. The gain is more visible when executing the program on smaller data sets (Fig. 4b). The parallelism extraction has been lead further than in the SPLASH benchmark, without modifying the underlying algorithm. Indeed, LU iterations are partially dependent. The SPLASH version does not extract this parallelism. The iterations have been sliced to extract the partial meshed-like parallelism (Fig. 1). The consequence is the "pipelining" of the multiple iterations across the system. Results show in all cases an improved performance. With a little parallelism available (>256 blocks), results are improved of about 12%. The instruction overhead is limited to less than 2%. The LU benchmark has also been simulated on a system with 2 cache levels (latency: 1/3/10 cycles) and an atomic bus (latency: 1 cycle) (Fig. 5a). Despite

Fig. 4. a) SPLASH benchmark performances: WATER(64 mol.), LU(256x256 matrix), MP3D(3000 mol.), FFT(214 points); b) LU performances with 64x64 matrix. SL is for classical single-level parallelism exploitation, ML is for the proposed multi-level parallelism exploitation. higher miss rates and no data locality optimizations, our parallelism management still produces better performance. The main overhead is due to the memory latency. The implementation of a classic data locality policy would sharply reduce this overhead. With more than 32 PUs, the locking overhead rises, as the critical parts become signi cant. However, the computation overhead remains negligible despite the deeper parallelism exploitation (idle time divided by 2.2). Finally, simulations have been performed on actual image processing programs(Fig. 5b). We used C programs written for sequential systems and parallelized it by some parallel command insertions and minor code modi cations. The rst application is a MPEG-2 decoder. Results are good up to 24 processors. Then, the relative curve atness shows that the parallelism is fully exploited as the bitstream decoding limits the maximum speedup. The second application is an image processing for a visual prosthesis [9]. It contains a highly sequential segmentation algorithm based on a zone growth and a histogram equalization with a signi cant critical part. The theoretic parallelism available for a zone growth is about 2. Two zones are allowed to growth in parallel if they do not overlap. Otherwise, one zone is squashed. We achieve also good performances, with a higher algorithmic overhead (about 10%) due to the zone squash. On a classical multiprocessor system, these programs would need major rewriting to exploit some of the available parallelism.

5 Related Work A large number of static and dynamic scheduling techniques have been proposed to exploit parallelism on multiprocessor systems. Static scheduling techniques[10][11] are essentially based on a thorough analysis of the program dependence graph. The compiler uses its global view on the program structure to

Fig. 5. a) left histogram and speedup A: classical parallelism management, right histogram and speedup B: with proposed parallelism management; TM is for Task Manager b) Image processing benchmarks modify the graph and determine when tasks could be executed and to which processor they will be allocated. Their advantage is the ability to use complex algorithms to compile near-optimal solutions. However, they are not suited to data-dependent applications. Dynamic scheduling techniques [4][12] mainly use the eligible task queue to compute an ecient scheduling. They compute or read several statistics for each task and select a task matching some criteria. The drawback of these methods is that they do not use the information contained in the future tasks of the dependence graph. Moreover, as only eligible tasks are taken into account, synchronization and parallelism management operations cannot be performed or prepared in advance, requiring complex time-consuming operations as in [4]. Hamidzadeh[13] and Johnson[1] are using a partial representation of the task graph to compute a partial schedule. They use heuristics or cost functions to determine an ecient schedule. However, Hamidzadeh does not study the data structures used to eciently implement their schedule computation. Johnson uses a hashing table to store the task graph. It increases the search operation eciency. However, the hashing table also hashes the graph structure, preventing to implement ecient graph management operations. On the opposite, our queue bank stores the graph in a structured fashion. The parallel task streams are clearly identi ed and graph-based analyses inspired from static scheduling techniques can be adapted to extract a partial schedule. The eligible tasks can be sent to a pool implementing classical load balancing techniques[12].

6 Conclusion We have presented a new parallelism management scheme to allow the exploitation of complex and general parallelism patterns in multiprocessor systems. The

generality is achieved by representing the parallelism with a topological description. The separate parallelism description allows to implement a high pattern complexity with a limited dynamic overhead. Indeed, the parallelism is no more spread throughout the program code and is decoded in parallel to task execution. The advantages of the parallelism description are dynamically exploited by a co-designed task manager. It stores a structured representation of the task dependence graph allowing to implement eciently graph management operations and task synchronization. Simulations validate the parallelism management on scienti c and image processing applications. For highly parallel applications, we have similar performances than current multiprocessor parallelism exploitation. However, when the application presents more complex parallelism patterns, performances can be increased up to 12% for the LU SPLASH benchmark compared with classical parallelism management. Furthermore, the syntax generality does not require major program rewritings to be able to describe the parallelism. Future work will be oriented to the study of the hardware/software implementation issue. Load balancing and data locality techniques will be added to enhance practical performances. A second research direction is the automation of the parallelization step.

References [1] Johnson, T., Davis,T., Had eld, S.: A Concurrent Dynamic Task Graph. Parallel Computing, Vol 22, No 2, Ferbruary 1996, 327{333 [2] Singh, J.P., Weber W-D., Gupta, A.: SPLASH: Stanford Parallel Applications for Shared Memory. Computer Architecture News, 20(1), July 1994, 5{44 [3] Chandra, R., Gupta, A., Hennessy, L.: Integrating Concurency and Data Abstraction in the COOL Parallel Programming Language. IEEE Computer, February 1994 [4] Rost, J., Markus, F.-J., Li Yan-Hua: "Agency Scheduling" A Model for Dynamic Task Scheduling. Proc of the Int. Conf. on Parallel Processing, Euro-Par 95, Sweden [5] Nikhil, R.S., Papadopoulos, G.M., Arvind: *T: A Multithreaded Massively Parallel Architecture. Int. Symp. on Computer Architecture, 1992, 156{167 [6] Stanford Compiler Group: The SUIF Library. Stanford University (1994) [7] Smith, M.D.: The SUIF Machine Library. Harvard University (1997) [8] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 Programs: Characterization and Methodological Considerations. Proc. of the 22nd Ann. Symp. on Computer Architecture, June 1995, 24{36 [9] Gilmont, T., Verians, X., Legat J-D., Veraart, Cl.: Resolution Reduction by Growth of Zones for Visual Prosthesis. Int. Conf. on Image Processing, 1996, 299{302 [10] Shirazi, B., Wang, M.: Analysis and Evaluation of Heuristic Methods for Static Task Scheduling. J. of Parallel and Distributed Computing, vol. 10, 1990, 222-232 [11] Park, G.-L., Shirazi, B., Marquis J., Choo, H.: Decisive Path Scheduling: A New List Scheduling Method. Proc. of the Int. Conf. on Parallel Processing, 1997, 472{480 [12] Dandamudi, S. P., Cheng, P. S.: A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems. IEEE Transaction on Parallel and Distributed Systems, vol. 6, no. 1, January 1995, 1{16 [13] Hamidzadeh, B., Lilja, D.J.: Dynamic Scheduling Strategies for Shared-Memory Multiprocessors. Proc. of the 16th Int. Conf. on Distributed Computing Systems, Hong Kong, 1996, 208{215

Lihat lebih banyak...

A New Parallelism Management Scheme for Multiprocessor Systems

Descrição do Produto

Comentários