Low-cost fully reconfigurable data-path for FPGA-based multimedia processor

July 7, 2017 | Autor: Marco Lanuzza | Categoria: Field-Programmable Gate Arrays, Floating Point
Share Embed


Descrição do Produto

LOW-COST FULLY RECONFIGURABLE DATA-PATH FOR FPGA-BASED MULTIMEDIA PROCESSOR Marco Lanuzza, Stefania Perri Department of Electronics, Computer Science and Systems, University of Calabria, Arcavacata di Rende - 87036 - Rende (CS), Italy {lanuzza,perri}@deis.unical.it

Martin Margala Electrical and Computer Engineering Department, 526 Computer Studies Bldg. University of Rochester, Rochester, NY 14627 [email protected]

Department of Electronics, Computer Science and Systems, University of Calabria, Arcavacata di Rende - 87036 Rende (CS), Italy [email protected]

adapt their circuit function to different computational demands. The reconfiguration process is the downloading of a new bit-stream onto the FPGA chip to entirely or partially change the supported functionality. The designer could realize separate circuits, each supporting a certain range of operations and then the reconfigurability could be used for changing the supported functionality. Unfortunately, due to their computationally intensive nature, multimedia applications can frequently require at run-time adaptations of the computational units to different operations and/or to different data types. Therefore, partially or entirely reconfiguring the device each time can imply consistent performance and power penalties [7]. For these reasons, the design of multimedia FPGA-based circuits able to self-adapt to different operations and to different data types avoiding the reconfiguration process is highly desirable. In this paper, a new 32-bit run-time reconfigurable data-path for multimedia applications is presented. The proposed architecture operates in Single Instruction Multiple Data (SIMD) fashion on 8-, 16- and 32-bit integer data and supports IEEE-754 [8] compliant single precision floating-point arithmetic computations. The proposed datapath is able to adapt itself to a specific operation and to the different supported data types without reconfiguring the device. The new architecture also exhibits sufficiently high computational capability with low power dissipation and limited resources occupancy. Moreover, because of its excellent modularity, the presented multimedia data-path is appropriate for realizing low-cost and low-power massively parallel architectures organized as arrays of processing logic that operate in either Single Instruction Multiple Data or Multiple Instruction Multiple Data (MIMD) mode. The remainder of the paper is organized as follows. In Section 2, a brief background on the computational requirements of multimedia applications is given and related works are discussed. In Section 3, a detailed description of the proposed data-path is provided. Implementation results are presented and discussed in Section 4. Finally, conclusions are given in Section 5.

ABSTRACT This paper describes novel data-path architecture for FPGA-based multimedia processors. The proposed circuit can adapt itself at run-time to different operations and data wordlengths avoiding time and power consuming reconfiguration. The new data-path can operate in SIMD fashion and guarantees high parallelism levels when operations on lower precisions are executed. It also supports IEEE-754 compliant single precision floatingpoint addition and multiplication. The proposed circuit has been characterized using VIRTEXII XILINX devices, but it can be efficiently used also in other FPGA families. 1. INTRODUCTION Modern FPGA architectures are optimized for high-density and high-performance logic designs. Very important features, like fast routing resources, embedded multiplier blocks, large range densities up to millions system gates, and block RAM memory modules, make FPGAs an excellent platform for supporting also computationally intensive applications, like multimedia [1-6]. These applications lead to stringent time, area, power and flexibility constraints. High-speed is needed for ensuring real-time processing, whereas low-area and lowpower are required for satisfying the limited area and power budgets of portable apparatus. Finally, highflexibility has to be guaranteed to fast match the rapid evolution of multimedia applications and to efficiently support their different computational requirements. Multimedia applications involve many different floating-point and integer operations, like additions, multiplications, comparisons, etc., on 8-, 16- and 32-bit data. As a consequence, realizing efficient and flexible architectures able to support all these operations can require a lot of hardware resources. SRAM-based FPGAs provide extremely flexible hardware platforms. In fact, they can be reconfigured to

0-7803-9362-7/05/$20.00 ©2005 IEEE

Pasquale Corsonello

13

circuits are very efficient examples of hardware modules customized for executing specific operations. Many other examples of floating-point and SIMD circuits can be found in literature, however all of them are specialized for hardware implementing one specific operation. Typical computations required in multimedia applications are reported in Table 2. They could be supported by integrating in the same circuit one FP_multiplier, one FP_adder, one SIMD adder, one SIMD multiplier and some auxiliary logic for also computing the absolute value of addition and subtraction. In order to save hardware resources, a good alternative could consist in realizing separate circuits and using the reconfigurability to change the supported functionality. Unfortunately, as explained above, this leads to consistent time and power penalties. To the best of our knowledge, low-cost FPGAbased architectures optimized for supporting both the floating-point and the SIMD operations, without requiring the reconfiguration action, do not exist in literature. Our work is therefore groundbreaking in this purpose.

2. BACKGROUND AND RELATED WORKS Combining on the same programmable device a general purpose (core) processor with a high-bandwidth memory interface and a reconfigurable coprocessor (essentially an accelerating data-path with very little control) has been demonstrated to be an efficient solution to accelerate dataintensive applications in the multimedia and communication domains [1, 3]. In accord to this architectural model, the master processor controls the execution of programs as well as the reconfiguration of the data-path coprocessor. In this way, the program execution can benefit from running regular elaborations through the coprocessor. As an example, during image processing, the master processor can reconfigure and switch the elaboration of kernel loops to the data-path coprocessor, ensuring a more time efficient execution. In multimedia applications both floating-point (FP) and integer operations are usually executed [9, 10]. While floating-point operations typically require 32-bit IEEE-754 compliant format, the integer operations require 8-, 16- or 32-bit data. SIMD paradigm is known as an efficient approach for speeding up these integer operations. In fact, SIMD architectures are able to support parallel operations on multiple data of 8-, 16- and 32-bit, ensuring higher parallelism levels for lower precisions. As it is well-known [10], this leads to a great advantage in speed, since integer operations on lower precisions data are the most frequently required in multimedia applications. Recently, several FPGA-based circuits have been proposed for accelerating both the floating-point and the SIMD operations. Among them, those described in [11, 12] are particularly efficient. In [11], a modular generator library is proposed for optimizing FPGA-based floating-point units. Some experimental results given in that paper for floating-point adders and multipliers are reported in Table 1.

Table 2. Operations supported by the proposed data-path. Instruction Add8 / Sub8 Mul8 Abs_add8 / Abs_sub8

Add16 / Sub16 Mul16 Abs_add16 / Abs_sub16

Table 1. Some related works. # Slices FP_Multiplier [11] FP_Multiplier [11] FP_Adder [11] FP_Adder [11] SIMD Multiplier [12] SIMD Multiplier [12]

18x18 mult.

Mhz

SIMD

fpop

1156

-

166

NO

YES

248

4

170

NO

YES

773

-

212

NO

YES

571

7

203

NO

YES

1300

-

111

YES

NO

FP_add / FP_sub

729

-

69.7

YES

NO

FP_mul

Add32 / Sub32 Mul32 Abs_add32 / Abs_sub32

The SIMD multipliers demonstrated in [12] self-adapt to different data types, avoiding time and power consuming reconfiguration process. The fastest and the cheapest SIMD multipliers presented in that paper are also reported in Table 1. The latter shows how the referenced

Description integer 8-b addition / subtraction integer 8-b multiplication integer 8-b abs. value of addition / subtraction integer 16-b addition / subtraction integer 16-b multiplication integer 16-b abs. value of addition / subtraction integer 32-b addition / subtraction integer 32-b multiplication integer 32-b abs. value of addition / subtraction floating-point 32-b addition / subtraction floating-point 32-b multiplication

Latency

Parallelism

MOPS

4

4

332

5

4

166

5

4

166

4

2

166

8

2

41.5

5

2

83

5

1

41.5

12

1

10.4

7

1

20.7

9

1

16.6

11

1

11.9

3. THE PROPOSED DATA-PATH The top-level architecture of the proposed multimedia data-path is depicted in Fig.1 and the supported operations are summarized in Table 2.

14

The Input Stage acquires two 32-bit operands, A[31:0] and B[31:0], and on the basis of the required operation, specified by appropriate control signals, it prepares the operands for the subsequent elaboration. The Input Stage is organized as illustrated in Fig.2. The Dispatchers are a multiplexing network that on the basis of the required operation, selects the appropriate 8-bit subwords of the operands A and B as inputs of the PEs. As clarified later, this operation is needed to efficiently compute multiplications on wider data. The FP data formatters and the Right Shifter prepare data for floatingpoint elaborations. The former arrange each operand in a 24-bit significand and an 8-bit exponent, whereas the latter executes the operands alignment pre-shift. In order to save hardware resources, pre-shift is provided only for the operand B. If A needs to be aligned, the two operands are swapped in advance.

Through the control signals indicating the required operation and the data type to operate on, two different operation modes are provided: the Independent Mode (IM) and the Carry-Linked Mode (CLM). During the IM, all the PEs run in parallel and each one operates independently from the others. This operation mode occurs for executing 8-bit integer operations in SIMD fashion (i.e. generating four parallel results). During the CLM, the running of the generic PE depends on the others. This operation mode takes place to perform 16-bit and 32-bit integer or floating-point addition-based operations. In particular, during the execution of 16-bit addition-based operations, PE1 is carry-linked to PE2 (i.e. the carry-out generated by PE1 is propagated to PE2) and PE3 is carry-linked to PE4. In this way, two parallel 16-bit operations can be executed. On the contrary, the execution of floating-point or 32-bit integer addition-based operations requires that the PEs are all carry-linked together. For 16-, 32-bit integer and floating-point multiplication operations, also the Secondary Processing Logic stage goes into action. As depicted in Fig.4, this module receives the partial results generated by the PEs as inputs and combines them through a Connection network, a SIMD adder and a Normalizer. The Connection Network prepares the operands for the SIMD adder that can compute 32- or 64-bit additions. The Normalizer uses the Zero Leading Counter (ZLC) to evaluate how many left shifts are needed for normalizing the results of floating-point operations to the IEEE-754 format.

Inputs Input Stage (dispatcher, pre-shifter & formatter)

PE4

PE3

PE2

Control signals

PE1

specifying operation and data type

Main Processing Logic

A[31:0]

B[31:0]

32-bit register

64-bit SIMD Adder Normalizer

FP data formatter

Secondary Processing Logic Outputs

32-bit register

FP data formatter

Dispatcher

Control signal

Control signal

Right Shifter

32-bit register

Fig. 1. The top-level architecture of the proposed data-path.

The Main Processing Logic stage consists of four 8-bit Processing Elements (PEs) linked as depicted in Fig.3. The multiplexers between the PEs allow the carry propagation path to be broken when operations on 8-bit or 16-bit data are executed, thus ensuring an efficient application of the SIMD paradigm.

Dispatcher

32-bit register Opr2[31:0]

Opr1[31:0]

Fig. 2. The Input Stage.

Control signals Opr2[31:24] Opr1[23:16]

Opr1 [31:24] a[7:0] co

Opr2[23:16]

a[7:0]

b[7:0]

a[7:0] FF

co

PE4

ci

R[15:0] PR[63:48]

1 0

PE3

ci

R[15:0]

co

b[7:0]

co

PE2

0

R[15:0]

ci

PR[31:16]

Fig. 3. The Main Processing Logic stage.

15

a[7:0]

b[7:0]

1

PR[47:32]

Opr2[7:0]

Opr2 [15:8] Opr1 [7:0]

Opr1 [15:8]

b[7:0]

1

PE1

0

R[15:0]

ci

PR[15:0]

1 0

PR [63:0]

Control signal 0

register

1 32-bit RCA

32-bit RCA

0

Connection Network

Calc. Signif. Calc. Exp.

64-bit SIMD Adder

24

ZLC 8 5

Normalizer register

Eval. Sign s

Exponent Subtractor

8 exponent

Left Shifter 23 fraction

Normalized Result [31:0] Out [63:0]

Fig. 4. The Secondary Processing Logic stage

performed in a similar way, but the operand b[7:0] is 2’s complemented and the signal cin is forced to 1. The 8-bit multiplication requires two steps to be executed. In the first step, each PE computes the partial products a[3:0]×b[7:4] and a[7:4]×b[3:0] and adds them by RCA3 and RCA2. The latter receives a carry-in co1 forced to 0. The 8-bit result obtained in this way is then stored in the 4-bit registers REGISTER2 and REGISTER3 and the carry-out co3 is stored in the flip-flop FF2. In the second step, the partial products a[3:0]×b[3:0] and a[7:4]×b[7:4] are computed and accumulated to the previously stored result. Each PE thus provides one 16-bit product through its output lines PR[15:0]. Also, for computing the 8-bit absolute values of the difference and of the summation, two steps are needed. In the first one, each PE uses the adders RCA1 and RCA2 to compute a[7 : 0] + b[7 : 0] + 1 or a[7 : 0] + b[7 : 0] + 0 and the obtained partial result is stored in the registers REGISTER1 and REGISTER2. In the second step, the previously stored partial result is 2’s complemented if it is negative (i.e. MSB =’1’). The generic PE generates one 8bit result through the output lines PR[7:0]. During the execution of 16-bit integer addition-based operations the data-path is in CLM and PE1 is carry linked to PE2, whereas PE3 is carry linked to PE4. The basic operations needed to calculate two parallel 16-bit multiplications by using the 8-bit PEs are shown in (1), where PP1, PP2, PP3, PP4, PP5, PP6, PP7 and PP8 are the partial products defined in (2); the link operator indicates a simple concatenation action; sh8l is a left shift by 8 bits. From (1) and (2), it can be easily understood, that the execution of the two 16-bit multiplications requires three steps. In the first step, the PEs compute the 16-bit partial products PP2, PP3, PP6 and PP7. In the second step, the SIMD adder in the Secondary Processing Logic stage sums PP2 and PP3 and, independently, PP6 and PP7. At the same time, the PEs compute the 16-bit partial products PP1, PP4, PP5 and PP8. In the third step, the Secondary Processing Logic stage links PP1 to PP4 and, independently, PP5 to PP8 and accumulates them to the

To efficiently support floating-point and SIMD operations listed in Table 2, each PE of the Main Processing Logic stage has been structured as illustrated in Fig.5. It can be seen that a limited amount of hardware resources is required. In fact, just two 4-bit multipliers, MULT1 and MULT2, four 4-bit ripple-carry adders, RCA1, RCA2, RCA3 and RCA4, and some auxiliary logic are used. b[7:4]

b[3:0]

a[3:0]

a[7:4] 0000/1

0000/1

MULT1

MULT2

Control signal Control signal

0000

0000 Control signals Control signal

co2 co3

RCA4

cin RCA2

RCA3 FF1

FF2

PR(15:12)

REGISTER3

PR(11:8)

co1

RCA1

MSB

REGISTER2

PR(7:4)

Control signal

REGISTER1

PR(3:0)

Fig. 5. The PE architecture.

3.1 INTEGER ARITHMETIC OPERATIONS For the execution of integer operations, the 32-bit operands received by the data-path are dispatched to the PEs as packed data. When 8-bit and 16-bit integer operations are required, each input data is treated as a group of four independent 8-bit sub-words and two parallel 16-bit subwords, respectively. In performing 8-bit operations, the PEs run in IM and operate independently on the 8-bit subwords of the operands. In this case, four parallel 8-bit results are outputted. When the 8-bit addition is performed, each PE calculates a[7:4]×0001 and b[7:4]×0001 by the multipliers MULT1 and MULT2, respectively. RCA1 and RCA2 compute a[3:0]+b[3:0]+0 and a[7:4]×0001+b[7:4] ×0001+co1, respectively, and the 8-bit result is provided through the output lines PR[7:0]. The 8-bit subtraction is

16

results obtained by the addition operations performed in the previous step, thus generating the two parallel final 32bit results. Among the supported 16-bit integer operations, just multiplication is described since it is the most complex. A[15:0]×B[15:0]=(PP4)link(PP1)+sh8l(PP2+PP3)

the sign magnitude format and normalizes the 32-bit result according to the IEEE-754 compliant format. In performing a floating-point multiplication, the PE4 adds the exponents while PE1, PE2 and PE3 multiply the 24-bit significands. In order to execute this 24-bit multiplication, an approach similar to the 32-bit integer multiplication is applied as described in Section 3.1. That is, the PEs sequentially compute the proper number of 16bits partial products that are then combined together by the Secondary Processing Logic stage. The 48-bit word resulting from the multiplication is truncated down to 24 bits and then normalized to the IEEE-754 compliant format.

(1)

A[31:16]×B[31:16]=(PP8)link(PP5)+sh8l(PP6+PP7) PP1= A[7:0]×B[7:0];

PP5= A[23:16]×B[23:16];

PP2= A[15:8]×B[7:0];

PP6= A[31:24]×B[23:16];

PP3= A[7:0]×B[15:8];

PP7= A[23:16]×B[31:24];

PP4= A[15:8]×B[15:8];

PP8= A[31:24]×B[31:24];

(2)

4 . RESULTS The proposed 32-bit fully-reconfigurable data-path has been realized as a three stage pipelined circuit using the XILINX™ Integrated Software Environment (ISE) 5.2 and the XILINX VirtexII XC2V500-6 device. However, it can be efficiently implemented also in any other FPGA family. The new circuit has been optimized by using both clock and placement constraints. In this way, all carrychains of arithmetic circuits exploit fast routing resources available in Virtex-II FPGAs [14]. The obtained layout is illustrated in Fig.6.

For the execution of 32-bit integer addition-based operations, the PEs are all carry-linked. Among the supported 32-bit operations, also in this case the multiplication is the most complex. The latter is executed as shown in (3), where the sixteen 16-bit partial products defined in (2) and in (4) are used. The steps needed to perform the 32-bit integer multiplication can be easily obtained in a way similar to the 16-bit multiplication discussed above. A[31:0]×B[31:0]=[(PP8)link(PP5)+sh8l(PP6+PP7)] link[(PP4)link(PP1)+sh8l(PP2+PP3)] +

(3)

+ sh16l[(PP12)link(PP9)+sh8l(PP10+PP11) + + (PP16)link(PP13)+sh8l(PP14+PP15) ] PP9= A[23:16]×B[7:0];

PP13= A[7:0]×B[23:16];

PP10=A[31:24]×B[7:0];

PP14= A[15:8]×B[23:16]; (4)

PP11= A[23:16]×B[15:8]; PP15= A[7:0]×B[31:24]; PP12= A[31:24]×B[15:8]; PP16= A[15:8]×B[31:24]; 3.2 FLOATING-POINT ARITHMETIC OPERATIONS For the execution of floating-point operations, the Input Stage extends the 23-bit fraction of the received 32-bit operands to their correspondent 24-bit significands and stores the operand signs for later use to evaluate the correct sign of the result. Floating-point addition is executed by the proposed data-path following the conventional algorithm described in [13]. The PE4 operates on the exponents of the two operands, while PE1, PE2 and PE3 are carry-linked to operate on the significands. If the operands have different signs, the subtraction operation significand(A)significand(B) is executed, otherwise significand(A)+ significand(B) is calculated. The Secondary Processing Logic stage then converts the 2’s complement result into

Fig. 6. The optimized layout of the proposed data-path

Post-layout characterizations demonstrated that the computational capability and flexibility of the proposed data-path are achieved with occupying only 765 slices. A running frequency of about 83MHz is reached with an average energy dissipation of just 11.2mW/MHz. It is worth pointing out that the achieved running frequency is well-knit for the chosen FPGA platform and it guarantees sufficiently high computational capability. In fact, as shown in Table 2, the new multimedia data-path

17

[5] S. Wong, S. Cotofana, S. Vassiliadis, “Coarse reconfigurable multimedia unit extension”, in Proc. 9th Euromicro Workshop on Parallel Distributed Processing, Mantova, Italy, February 2001.

can execute up to 332 millions of 8-bit addition/subtraction per second and 166 millions of 8-bit multiplication per second. However, the main innovation introduced by the proposed data-path is the ability of performing several kinds of SIMD and floating-point operations elaborating different data types, using very limited resources. This can be better appreciated observing that if the lowest cost modules referenced in Table 1 are used for realizing a unique circuit having the same set of floating-point and integer operations of the new data-path, at least 1600 slices and 11 embedded multipliers would be occupied.

[6] F. Barat, M. Jayapala, A. T. Vander, G. Deconinck, R. Lauwereins, H. Corporaal, “Low Power Coarse-Grained Reconfigurable Instruction Set Processor”, in Proc. 13th International Conference on Field Programmable Logic and Applications (FPL), Lisbon, Portugal, 1-3 September 2003. [7] J. Resano, D. Verkest, D. Mozos, S. Vernalde, F. Catthoor, “ A hybrid design-time/run-time scheduling flow to minimise the reconfiguration overhead of FPGAs”, Elsevier Microprocessors and Microsystems, Vol.28, Issue 5-6, pp.291-301, The Netherlands, August 2004.

5. CONCLUSIONS A new fully reconfigurable data-path for accelerating multimedia applications has been presented. The new FPGA-based architecture operates in SIMD fashion on 8-, 16-, and 32-bit data, and supports all the integer and floating-point operations typically required in multimedia applications. The new architecture is appropriate for realizing low-cost and low-power massively parallel reconfigurable architectures for data-intensive applications. Moreover, the presented data-path requires only one clock cycle to be reconfigured. The proposed circuit occupies only 765 slices and dissipates about 11mW/MHz. It can execute up to 332 millions of 8-bit addition/subtraction per second and 166 millions of 8-bit multiplication per second.

[8] The institute of Electrical and Electronics Engineers, Inc, “IEEE Standard for Binary Floating-Point Arithmetic”, ANSI/IEEE Standard No. 754-1985, New York, August 1985. [9] J. Fritts, W. Wolf, B. Liu, “Understanding multimedia application characteristics for designing programmable media processors”, SPIE Photonics West, Media Processors '99, San Jose, CA, pp. 2-13, Jan. 1999. [10] S. Nazareth, R. Asokan, “Processor Architectures for Multimedia”, Academics paper, http://www.cs.dartmouth.edu/~nazareth/academic/CS107.p df, Nov. 2001. [11] E. Roesler, B. Nelson, “Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture”, in Proc. 12th International Conference on Field Programmable Logic and Applications (FPL), Montpellier (La Grande-Motte), France, 2-4 September 2002.

REFERENCES [1] T. Miyamori, K. Olokotun, ”REMARC: Reconfigurable Multimedia Array Coprocessor”, in Proc. ACM/SIGDA FPGA ’98, Monterey, Feb. 1998

[12] S. Perri, P. Corsonello, M. A. Iachino, M. Lanuzza, G. Cocorullo, “ Variable Precision Arithmetic Circuits for FPGA-Based Multimedia Processors”, IEEE Transactions on Very Large Scale Integration Systems, Vol.12, Issue 9, Pag.995 - 999, USA, September 2004.

[2] H. Singh, M.-H. Lee, G. Lu, J. Kurdaki, et al, “ MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications”, IEEE Transactions on Computers, Vol. 49, No. 5, May 2000. [3] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, R. Taylor, “ PipeRench: A Reconfigurable Architecture and Compiler”, In IEEE Computer, Vol.33, No. 4, April 2000.

[13] J. J. F. Cavanaugh, “Digital Computer Arithmetic- Design and Implementation”, McGraw-Hill Book Company, 1984. [14] Xilinx Inc., “Constraints Guide — ISE 5”, U.S.A., 2002.

[4] S. McBader, P. Lee, “A programmable image signal processing architecture for embedded vision systems“, in Proc. 14th Int. Conf. Digital Signal Processing, Santorini Island, Greece, July 2002.

18

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.