Pipelining router design improves parallel system performance

June 6, 2017 | Autor: Carlos Carrion | Categoria: Distributed Shared Memory System, Communication System, Performance Evaluation, Message Passing, High performance, Critical Parameter, Parallel Systems, Critical Parameter, Parallel Systems

Share Embed

Denunciar este link

Descrição do Produto

Pipelining Router Design Improves Parallel System Performance C. Carri6n , J.A. Gregorioo, R. Beivideo Dpt. Informdtica Universidad de Castilla-La Mancha 02071 Albacete, Spain phone: +34967599200 fax: +34 967 599224 [email protected]

Dpt. Electmnica y Cornputadoreso Universidad de Cantabria 39005 Santander, Spain phone: +34 942 201399 fax: +34 942 201873

G q m , mon)@atc.unican.es

Absfracf-Efficient communication on fetching remote data is a critical while maintaining the same functionality,the final system perparameter in distributed shared-memory multiprocessors (DSM) in order formance can be improved. Specifically, the parameters anato achieve high performance. Message-passingtechniques are used in many modern communication systems and routers are essential building blocks lyzed are both the clock frequency and the number of pipeline for these communication systems. Hence, in this paper emphasis is placed stages of the router. Thus, the CAD tool Synopsys [17] together on the design of routers for 2-ary n-cube networks. Based on a simple dead- with the standard library MIETEC 0.35~ were employed to oblock free algorithm, we analyze the influence of the router structure. To tain the main characteristics of the routers. be more precise, the parameters considered were the clock frequency and We have focused on the design of a router for k-ary 2-cube the number of pipeline stages of the router. The performance evaluation for DSM applications shows there are significant gains in using segmented networks and the routing mechanism employed is the bubble alrouters designs. In our evaluations, results show an improvement of up to gorithm [5] [6]. This algorithm is based both on a dimension 12% in the execution time of some applications. This improvement occurs order routing (DOR) [8] and on a restricted injection policy exeven though the base latency of the router has increased by 40%.

Index Terms - DSM systems, k-ary n-cube networks, bubble algorithm, pipelined router design I. INTRODUCTION

Emphasis is placed on the design of the routers for parallel machines. Our propose is to give some clues to the architect on how to implement the modules of the router to achieve an efficient communication network. Observing the mature area of processor design, we can find that instruction-level parallelism is the key to obtaining the highest-performance processors. Pipelining is basically the implementation technique used in order to overlap instructions and it yields a reduction in the average execution time per instruction. Although execution time of each instruction increases due to the overhead in the control of the pipeline, program execution time is reduced due to the increment of the number of instructions completed per unit of time. Then, similar to architectural design of processors, we might assert that network performance could be improved by using pipelining techniques. Nevertheless, we can appreciate that there is not a clear tendency in router design. For instance, the TRC [8] used asynchronous logic, the Chaos router [23 uses synchronous logic with a period of 15ns and a base latency through the router of 4 cycles while the SP2 router has 16 cycles [3].

tended to the dimension changes. Note that the success of segmentation processors design is due to the correlation of program instructions and the fact that the programs generate enough flow of data to maintain the pipe stages full. However, the traffic generated by the applications on parallel systems is not well-known which makes it more difficult to get a clue about router designs. In this work, this problem has been solved by using a register level time-driven simulator that simulates the processor, the memory and network behaviour [14] [19]. Analyzing the results of running DSM applications we have checked that the execution time is reduced using pipelined router designs. The higher router bandwidth, the better performance of DSM systems and this occurs even though the base latency of the router increases. The structure of this paper is as follows. First, we fix the properties of our work environment. The basic router architecture is described in section 11. Next, in section 111, some router architecture modifications are proposed. Then, in section IV, the performance of DSM systems are presented and discussed. Finally, in section V, we assess all the results obtained in order to draw the most important conclusions.

11. GETTINGA ROUTERDESIGN

In keeping with the basic level of our study we have concentrated on the Deterministic Bubble Algorithm for two dimensional torus because of its inherent attractiveness. First of all, low dimension topologies obtain good performance as is demonThis work has been done with support of the Spanish CICYT under grants strated by the fact that most of current parallel machines are built TIC97-0897-CO4and TIC98-1162-CO2-01

Then, the basic goal of this work is to analyze what are the conditions to be fulfilledby the router design parameters so that,

195 1087-4089/00 $10.00 0 2000 IEEE

with two or three dimensional networks [l] [3] [16]. Secondly, although almost all the current routers implement virtual channels, and several of them also incorporate adaptive routing, our router architecture does not implement these features. This fact reduces the hardware cost and does not modify the conclusions of our evaluation. The main characteristics of the Deterministic Bubble Algorithm can be resumed as follows: 1.- Messages are divided in packets of fixed length. Each packet is composed of several flits (flowcontrol digits) and the header flit of the packet carries the routing information to relay packets from their source to their destination node. 2.-Dimensional order routing (DOR) [8] together with virtual cut-through flow control is used [13]. A header flit is split out each time the packet changes dimension. 3.-No virtual channels are required to prevent the deadlock problem [9] [ 101 1151. 4.-A restriction based on checking free buffer space for almost two packets is applied when a change of dimension is required by the traveling packets [ 5 ] . In all the evaluations presented later we have assumed a bidimensional torus net, 8x8, with an inter-node datapath of 32 bits and 6 control signals for bidirectional channels. Communication between routers is asynchronous implementing channel pipelining. Thus, we assume that link frequency is independent of the internal clock frequency and is only limited by wire and driver characteristics. The other control signal is used to inform the neighbor router about the transmission of the end of the packet. There follows a description of the basic hardware models and some details of its implementation.

Fig. 1. Router structure for a k-ary 2-cube network

The arbitration mechanism is done in a round-robin fashion in order to guarantee fair resource assignment. The delay of this component is determined by the arbitration process which depends on the number of input ports, 2n l in our case, n being the dimension of the network.

A. Router architecture

+

MODULE

TIME

Synchronous Routing Crossbar(5xl)

1.54 1.88 2.14

MODULE FIFO (80flits)

TIME 2.90

FIFO(40flits)

2.38 2.09

FlFO (20flits)

TABLE I

The organization of the bubble router is shown in Figure 1. To have a global idea of the hardware structure of the router, let us briefly describe the functional modules of the circuit. 1. The inputporr synchronizes the arriving flits with the intemal clock. The protocol between routers has been chosen to avoid the necessity of an acknowledgement signal for each flit sent. Thus, the input port needs storage capacity for some flits in order to guarantee that none of the possible arriving flits, not even those in flight on the channel’s wires, are missed. 2. The input edge FIFO buffer whose critical path delay depends on the flit width and F F O depth. 3. The routing block. Basically is a control block that works on the head flit selecting the profitable output port. Relative address is also updated by this block either decreasing by one the head or splitting out the first head flit. 4. Crossbar block that both arbitrates the accesses to the output ports and establishes the corresponding connection paths.

196

TIMEDELAY OF EACH

ROUTER BLOCK (NANOSECONDS)

111. CLOCK FREQUENCY AND PIPELINE STAGES

In this section, we are going to present different router designs. Although the basic router architecture is shown in Figure 1 we can m o w some quantitativeparameters, such as the capacity of the input buffers or the number of pipeline stages of the router. From previous studies [5] [6] we know that a good cost-performance choice about the size of the input buffers is 8L, where L is the length of the packets. Thus, we have fixed the capacity of the input buffersto this value for all simulations. Then, our first router design, Design A, presents a clock frequency of 340 MHz (see Figure 2). This design has a base latency of 4 and 3 cycles for the head and the data flits, respectively.

Design

Clk Period

Design A DesignB Design C

ns 2.90 2.38 2.14

i

Pipe. Stages Base latency cycles 4

5 7

head-data(ns) 11.6 - 8.70 11.9 - 9.52 14.98 - 12.84

TABLE I1

MAINCHARACTERISTICS OF THE DIFFERENT ROUTER

Fig. 2. Cycles required to travel through the router. Design A.

In general, the clock speed can be increased by: a) Balancing stages and b) Increasing the number of pipeline stages. So for this particular design, Design A, the input buffer must be pipelined. Dividing the buffer into two FIFOs with half capacity we get a new router design. The structure of this router, Design B , is shown in Figure 3. Note that in this case the router clock frequency has been increased to 420 MHz. The base latency of the router for the header flits and the data flits is of 5 cycles (11.9ns) and 4 cycles (9.52ns), respectively.

n

DESIGNS

IV. PERFORMANCE EVALUATION Improving interconnection subsystems is crucial for the overall performance of a distributed shared-memory multiprocessor system (DSM). For that purpose, we have used a simulation tool that has allowed us to get the network performance under real applications and under synthetic load patterns. This tool is composed of two parts. On one hand, RSIM [ 191 simulates the architectural behaviour of the system. That is, RSIM allows us to configure the main parameters of the processor as well as the memory system of the ccNUMA machines. On the other hand, Sycosys [ 141 reproduces the network behaviour considering its hardware characteristics. Now, having explained the tool used to evaluate the network, we are going to present the results obtained for both synthetic patterns loads and real applications. A. Synthetic Pattern Loads

Fig. 3. Cycles required to travel trough the router. Design B.

Following with router segmentation and since the FIFO is again the module that determines the clock period, the FIFO has been divided again into two modules with less capacity. Thus, Design C uses 4 FIFO modules with capacity for 20 flits. In this case, the clock frequency of the router is fixed by the crossbar module at 465 MHz. The number of pipeline stages for the header flits is of 7 cycles which translates into a delay of 14.98 nanoseconds. Note that pipelining the router has pros and cons. In fact, Design B presents an increment in the base latency of the router of 9% respect to the latency of Design A. But the clock period of Design B is reduced 18%respect to Design A. These values are obtained for data flits. Comparing Design C with Design A we can appreciate an increment of 47% in the base latency and a clock period 26% faster. Table 11shows the main characteristics of the different router designs. Observe that the header flits employ one extra cycle respect to the data flits in going through the router due to the routing module.

The average latency of the packets for an 8 x 8 torus network under some synthetic traffic patterns and different loads are shown in Figure 4. The length of the packets has been fixed to 10 flits and the input buffers have capacity for 8 packets. Results for random, bit-reversal, perfect-shuffle and transpose traffic show that the network using router Design C is capable of supporting more load without reaching the saturation point. Thus, the performance of DSM systems could be improved by increasing the clock frequency of the routers if the applications impose a high load on the network. This situation would also exists even if the traffic flowed in short bursts. During a burst, the injection rate is high and a low clock period can avoid reaching a non-stable network state. Regarding the base latency of the network, we can see that the three designs analyzed present similar values. At first glance, this might surprise us because the base latency of the routers is quite different. Thus, packets carrying no data flits could have a bigger network latency. So, we would expect an increment in the latency with the number of pipeline stages. Nevertheless, the results show that at low loads the clock frequency also has a good impact on the final network performance. This can be explained as follows. The use of a router design with more pipeline stages

197

0 1

2

1

4

5

6

7

8

Thrwghpuqflitdns)

Bit-Reversal

Random

E

F5

801

600

400

9 100

0 1

2

3

.

5

6

7

8

Thmughpttqflitdns)

Transpose

Perfect-Shuffle

Fig. 4. Average packet latency.

reports a higher clock frequency,that is, the data flits of a packet can be removed from the network at a higher rate reducing the final packet latency. Recall that in these evaluationsthe packets are composed of 10 flits. All the previous results have only considered synthetic loads and as far as we are concerned the traffic in DSM can have different characteristics and therefore, the assertions about the effect of the router structure might be misunderstood. Therefore, in the next section we are going to evaluate the network performance under real applications.

~~

Bus Memory Directory Coherency

Minimum access time 3cycles. Full Mapped $-state invalidation. MESI

I Consistency I Relaxed Consistency. RC TABLE IV

MAINPARAMETERS

B. DSM Multiprocessors The applications used for evaluating the cc-NUMA architecture with superscalar processors were the suite benchmarks SPLASH2 [20]. Tables I11 and IV show the values of the main memory and processor parameters that we have considered in our evaluation.

I Frequency: 600 MHz I I Functional Units: 2 ALU, 2 FPU, 2 ADU I I Branch Prediction: two-bit history, BHT 512in~uts.4-levels 1 I

I

L2

16Kb, write-back, direct-mapped,2ports line size 32byte, 8 MSHRs, hit time lcycle 64KB, write-back, 4-way associativity, lport line size 64bvtes. 8 MSHRs, hit time 8cvcles 200MHz, 32bytes, multiplexed $-interleaving, hit time 18 cycles

Instructions per cycles: 4 Window size: 64 inputs Queue size to memory access : 32

I ~

TABLE 111

MAINCHARACTERISTICSOF THE PROCESSORS.

As in the previous cases, the topology of the network is a 2D torus with 8x8 processors and the deterministic Bubble algorithm has been used. Packet lengths are different depending on

198

OF THE MEMORY-HIERARCHY.

what they are: a request (8 phits) or a reply message (20 phits). Figure 5 show the average network latency and the load applied to the reply network. Design C achieves the lower latency. That is, the best network latency is achieved by the router with the highest clock frequency. At this point it is interesting to remark that the load supported by the reply network varies from 8 flits per cycle in the case of Radix application to 0.15 flits per cycle in the case of LU application. It is interesting to see the temporal evolution of the load applied to the network. Figures 6, 7 show this evolution for the load and the latency of the packets. We can appreciate that there are burst periods, that is, the load applied by the applications presents strong variations during the running time. So, the network can be at short periods close to the saturation point. Then, a pipeline router design would allow support high load without reaching saturation and therefore the network would obtain better performance.

800

C

600 400

200

0

ThroughpuqfllWns)

Bit-Reversal

Transpose

Perfect-Shuffle

Fig. 4. Average packet latency.

reports a higher clock frequency, that is, the data flits of a packet can be removed from the network at a higher rate reducing the final packet latency. Recall that in these evaluations the packets are composed of 10 flits. All the previous results have only considered synthetic loads and as far as we are concemed the traffic in DSM can have different characteristics and therefore, the assertions about the effect of the router structure might be misunderstood. Therefore, in the next section we are going to evaluate the network performance under real applications.

L2

I

Bus Memory Directorv Coherency Consistency

16Kb, write-back, direct-mapped, 2ports line size 32byte, 8 MSHRs, hit time lcycle 64KB, write-back, 4-way associativity,lport line size 64bytes, 8 MSHRs, hit time 8cycles 200MHz, 32bytes, multiplexed 4-interleaving, hit time 18 cycles

I Minimum access time 3cvcles. Full Mapped I 4-state invalidation. MESI Relaxed Consistency. RC TABLE IV

M A I NPARAMETERS OF THE MEMORY-HIERARCHY.

B. DSM Multiprocessors The applications used for evaluating the cc-NUMA architecture with superscalar processors were the suite benchmarks SPLASH2 [20]. Tables I11 and IV show the values of the main memory and processor parameters that we have considered in our evaluation.

1

L1

I

Freauencv: 600 MHz

Instructions per cycles: 4

I Queue size to memory access : 32 TABLE III MAINCHARACTERISTICS

OF THE PROCESSORS.

As in the previous cases, the topology of the network is a 2D torus with 8x8 processors and the deterministic Bubble algorithm has been used. Packet lengths are different depending on

199

what they are: a request (8 phits) or a reply message (20 phits). Figure 5 show the average network latency and the load applied to the reply network. Design C achieves the lower latency. That is, the best network latency is achieved by the router with the highest clock frequency. At this point it is interesting to remark that the load supported by the reply network varies from 8 flits per cycle in the case of Radix application to 0.15 flits per cycle in the case of LU application. It is interesting to see the temporal evolution of the load applied to the network. Figures 6, 7 show this evolution for the load and the latency of the packets. We can appreciate that there are burst periods, that is, the load applied by the applications presents strong variations during the running time. So, the network can be at short periods close to the saturation point. Then, a pipeline router design would allow support high load without reaching saturation and therefore the network would obtain better performance.

_1

30G

5

250

4

; ; 200

2

~

. . ... .. ...

....

.

I

3

s

2

.. .... .. .. .. .._. . .. . ... .. . .. ......

150

L:

"

100

-1

2 1

50

0

G

Des$nA

Desj3-nB

DesbnC

DesQn B

Des%n A

I

Desiqn C

Fig. 5. Results for the reply network.

Fig. 6. Evolution of the load and the latency of the packets. FFT application. Reply network.

I

4.-

%Et05

1.w

2 . m

2.Recutlon time

3 . m

3,-

4 . w

4 . w

9-

1.m

2

m

2Ha6

3.-

3 . m

4.-

h.WtiO"tii.

Fig. 7. Evolution of the load and the latency of the packets. Radix application. Reply network,

Looking at results obtained by the request network, see Figures 8, we appreciate that load applied is reduced and as a clear consequence of this fact the minimum latency is achieved by Design B. Then, reducing the clock period of the router the latency of the reply network, porting the cache lines, is improved. Nevertheless, the latency of the request network is increased. Then, segmentation has opposite effects on the request and reply networks. So, in order to get an answer we are going to compare the execution time of the applications using the different router designs. Thus, in Figure 9 we show the execution times required by the DSM system when a FFT, a RADIX and LU applications are run. As we can see a reduction of about 12% is achieved using Design C instead of Design A for the FlT and the Radix. The improvements are less significant with the LU application because of the low load applied by this application on the interconnection network. But, in general, a non-pipeline router design obtains worse performance than a pipeline router.

V. CONCLUSIONS

Improving interconnection subsystems is crucial for the over-

all performance of a distributed shared-memory multiprocessor system (DSM). Thus, emphasis is placed on the influence of some quantitative parameters of the router: the clock frequency versus the number of pipeline states of the router. The goal of this paper is to provide some guidelines to the router designer so as the performance of DSM can be improved. The study has been focused on k-ary n-cube networks. Pipelining router design increases the routers' clock frequency as well as its base latency. The greater the clock frequency, the more throughput is available to support the network which mean that the saturation point is reached at higher loads. On the other hand, increasing the base latency of the router, the network latency at low loads is increased. So, there are two opposite effects. The performance evaluationfor running applications on DSM systems shows that reducing router clock period, the latency of the reply network, porting the cache lines, is improved. This

200

Lihat lebih banyak...

Pipelining router design improves parallel system performance

Descrição do Produto

Comentários