A Real-Time and Parametric Parallel Video Compression Architecture Using FPGA

June 6, 2017 | Autor: F. Magalhães Frei... | Categoria: Video Compression, Discrete Cosine Transform, Real Time, High Definition, Parallel Architecture

Share Embed

Denunciar este link

Descrição do Produto

A Real-Time and Parametric Parallel Video Compression Architecture Using FPGA C´assio A. Carneiro1 , Francisco M. P. Garcia1 , Fl´avia M. Freitas1 , Z´elia M. A. Peixoto1 , Amanda R. M. Diniz1 , and Abraham Alcaim2 1

2

Pontifical Catholic University of Minas Gerais, Av. Dom Jos´e Gaspar 500, 30535-610, Belo Horizonte, MG, Brazil, [email protected], {fgarcia, flaviamagfreitas, assiszmp}@pucminas.br, [email protected] Pontifical Catholic University of Rio de Janeiro, Rua Marquˆes de S˜ ao Vicente 225, 22453-900, Rio de Janeiro, RJ, Brazil, [email protected]

Abstract. This paper presents a novel parallel architecture which performs a streamed-based processing of the two-dimensional Discrete Cosine Transform (2D-DCT) for real time video compression applications. This proposal consists in using a programmable device, such as FPGA, to implement kernels of one-dimensional DCT (1D-DCT), referred to as DCT-kernels, which can be instantiated, so many as necessary, to attend the required pixel rate for a specific purpose. The implementation of the architecture proposed for the DCT-kernel also presents some interesting features that represent an advantage over the classical architectures for 1D-DCT available in the literature, mainly when a parallel architecture is supposed to use some of them. Two different applications, standard definition television (SDTV) and high definition television (HDTV), have employed the proposed parallel architecture using different number of DCT-kernels in order to show the potential of its use and real possibilities of enlarging the set of candidate applications.

1

Introduction

Image and video compression has been an area of fast scientific and technological development because of its large employment in many fields of Engineering. Compression techniques offer the possibility to store or to transmit the vast amount of data necessary to represent still image and video in an efficient and robust way. Some of the most important applications are Standard Television (SDTV) [1, 2], High Definition Television (HDTV) [3], Interactive Television and 3D-TV [4] and High Definition Digital Video Disc (HD-DVD), among others. Unfortunately, encoding algorithms with higher performance have come at the price of extraordinarily huge computational complexity and memory access requirement. This makes it difficult to design hardwired encoders for real-time applications. In some cases, the amount of data to be processed may also be too high and so, the bandwidth requirement increases. A typical example is

2

the highly interactive and recreation multimedia (HDTV and HD-DVD), which demands much higher compression ratio and quality for video content. In order to achieve the necessary throughput, pipelining and parallel processing techniques have then been employed whenever the required processing is neither sequential nor data-dependent [3, 6]. One of the basic strategies for coding still images and video is the Transform Domain Coding, which has become a very popular method for lossy still image and video coding [7]. It consists in quantizing and encoding decorrelated transform coefficients rather than the original pixels of the images. The most popular and well-established transform technique is the Discrete Cosine Transform (DCT) used in the lossy JPEG, MPEG family and H.264 coder standards [8]. The DCT is applied strictly as a block-based approach on blocks of size N × N pixels (usually, N = 8) and it is employed to reduce spatial redundancy at either intraframe or interframe encoding [7]. In this latter case, it is accomplished by motion estimation and compensation techniques to reduce temporal redundancy [7]. Recently, hardwired solutions to implement the DCT have been the focus of research effort. Several evolutions of the original algorithm [8] and implementations have been developed for real-time applications, aiming at reducing costs, improving performance and increasing the throughput [3]. The main motivation of this paper is to present a parametric parallel architecture to compute a block-based DCT scheme for real-time applications. The proposed architecture uses a kernel cell to process the 1D-DCT over a vector of N components. Depending on the application, the required throughput may vary and a different number of these kernel structures may be used (the number of kernels to be used is a parameter of the proposed architecture). The reduction of the hardware resources needed to implement this kernel cell, if compared to recently published works [9, 10, 6], enables lower circuit area. It is important to note that in block-based DCT scheme there are no data dependencies between neighboring blocks, pointing to the possibility of using parallel processing.

2

The Discrete Cosine Transform

For the block-based DCT approach, the input image are split into disjoint blocks U of N × N pixels. In general, a linear, separable and unitary 2D-transform can be represented as a matricial operation on each block U in the form C = AU AT , in which C is the N × N transformed coefficients block, A is a N × N transform matrix and AT denotes the transpose of A. A unitary transform is reversible, since the original block U can be reconstructed using a linear and separable inverse transform, doing U = AT CA. However, in a practical coding scheme, the coefficients will be quantized and the original image block will be only approximated, resulting a lossy compression. Today, block-based DCT schemes are widely used in most image and video coding due to their high decorrelation performance [7] and the availability of fast DCT algorithms suitable for hardware implementations in real-time applications. These algorithms use the separability property of the 2D-transform that allows

3

its implementation by two 1D-transform operations. This reduces the computational effort and the necessary area for the hardware implementation. In this sense, the block of coefficients may be computed doing: C T = A(AU )T = AU T AT = (AU T )AT = GAT = (AGT )T

(1)

That is, first the rows of U are processed to produce the 1D-transformed matrix G = AU T . In the following, the rows of the matrix G are processed to obtain the 2D-transformed matrix C T = (AGT )T . Finally, C T must be transposed in order to produce the final transformed image block C. Using the separability principle described by Equation 1 in the computation of the 2D-DCT, the 1D-DCT processing over the xth row of U (Ux ), x = [0, · · · , N − 1] will produce the xth transformed row of G(Gx ), whose components are given by [7], Gxy = ky

N −1 X j=0

Uxj cos

(2j + 1)yπ , y = [0, · · · , N − 1] 2N

(2)

1 if y = 0 and ky = 12 if y 6= 0. The expression given by Equation 2 with ky = 2√ 2 is the basis of the fast algorithms proposed in the recent literature to implement the 1D-DCT.

3

Related Works Concerning the 1D-DCT

Several fast algorithms concerning the implementation of the 1D-DCT have been proposed in the literature [11, 12, 2, 6]. The work presented by [5] compares the required number of additions and multiplications in a few of them. In order to reduce the number of multiplications, which is the most expensive operation, all of them use the separability property, while some of them also apply the scaling principle. It consists in excluding the multiplications by constant factors from the DCT algorithm and embedding them in the quantization step. Table 1 compares the number of multiplications and additions used by some fast 1D-DCT algorithms. It can be observed that the use of the principles of separability and scaling enables the reduction of the required multiplications and, in this sense, the one introduced in [2] was the most efficient. Table 1. Fast 1D-DCT algorithms and the required number of operations [5] Algorithm Multiplication Chen [11] ∗ 16 Lee [12] ∗ 12 Lee [12] ∗∗ 11 Chen [11] ∗∗ 8 Arai [2] ∗∗ 5 ∗ use separability ∗∗ use separability and scaling

Addition 26 29 28 26 29

4

The algorithm proposed by [2] has been later improved by [9], which reduced from 16 to 12 the number of required two’s complement operations, while keeping the same number of additions and multiplications. The algorithm proposed in [9] has six completely independent stages, which enables the use of parallelism. However, preliminary simulations of the algorithm [9], performed by [10], have pointed to a sintax error in one of the variables. The algorithm introduced in [10] corrected this initial problem and has also proposed simple modifications in the previous architecture, in order to have its latency reduced. The final six stages are showed in the following and the diagram of the proposed architecture [10] is presented in Fig. 1. Stage A b0 = a0 + a7 b1 = a1 + a6 b2 = a3 − a4 b3 = a1 − a6 b4 = a2 + a5 b5 = a3 + a4 b6 = a2 − a5 b7 = a0 − a7

Stage B c0 = b0 + b5 c1 = b1 − b4 c2 = b2 + b6 c3 = b1 + b4 c4 = b0 − b5 c5 = b3 + b7 c6 = b3 + b6 c7 = b7

Stage C d0 = c0 + c3 d1 = c0 − c3 d2 = c2 d3 = c1 + c4 d4 = c2 − c5 d5 = c4 d6 = c5 d7 = c6 d8 = c7

Stage D e0 = d0 e1 = d1 e2 = m3 ∗ d2 e3 = m1 ∗ d7 e4 = m4 ∗ d6 e5 = d5 e6 = m1 ∗ d3 e7 = m2 ∗ d4 e8 = d8

Stage E f 0 = e0 f 1 = e1 f 2 = e5 + e6 f 3 = e5 − e6 f 4 = e3 + e8 f 5 = e8 − e3 f 6 = e2 + e7 f 7 = e4 + e7

Stage F S0 = f 0 S1 = f 4 + f 7 S2 = f 2 S3 = f 5 − f 6 S4 = f 1 S5 = f 5 + f 6 S6 = f 3 S7 = f 4 − f 7

Fig. 1. Diagram of the architecture proposed in [10]

The proposed architecture [10] is basically composed of ping pong buffers and arithmetic operators. The ping pong buffer holds the data so that the operations can be performed. It has two columns of registers: the first one accepts serial

5

data input (ping) and the second one provides parallel data output (pong). In this way, a new pixel is filled in the first buffer (ping) in each clock cycle. After a set of eight pixels has been stored, this set is transferred to the second buffer (pong). In the stages “C” and “D”, nine positions are necessary in the respective buffers. Additions and subtractions as previously described for each stage are simultaneously performed in one clock cycle. The five multiplications required to calculate the 1D-DCT over a segment of N elements are computed during six clock cycles. The algorithm proposed by [10] has been chosen as the basis for the implementation of the 1D-DCT in this paper. It will be modified in order to design a kernel structure for the parametric parallel architecture to be proposed. This modified 1D-DCT kernel structure is going to be introduced in the next section and it will be referred to as DCT-kernel.

4

The DCT-kernel Architecture

The proposed DCT-kernel aims at using a lower number of flip-flops if compared to [10]. To achieve this, the ping-pong buffers are replaced by simple ping buffers. Due to the elimination of the pong buffers, the data in each stage are not placed in a fixed position, but they move dynamically while the ping buffer is being filled in. To perform the arithmetic operations, it is necessary to properly modify the addresses of the multiplexes. Some stages also had the number of buffers reduced due to the elimination of the unnecessary data storage (i.e. d0 = e0 = f 0). The proposed DCT-kernel also offers a speedup in relation to the architecture presented in [10]. This was possible by including a pipelining stage in the arithmetic operators that enabled a maximum simulated operation frequency equal to 80 MHz. This will make easier the parallel use of this kernel structure at video compression schemes for HDTV purposes. The increasing of the number of required flip-flops for the pipelining in the arithmetic operators has been compensated by the elimination of the pong buffers. In this way, the DCT-kernel demands only 41.66% of the flip-flops used in [10]. Moreover, the DCT-kernel introduces a shorter period of latency than the architecture proposed by [10], since the calculations in the N th stage can start before the calculations in the (N − 1)th stage have finished (which is not possible in [10]). Figure 2 presents the proposed architecture for the DCT-kernel, where it is possible to observe the reduction of the number of buffers. For example, considering pixels represented with 8 bits, stage “D” in the implementation of [10] requires 198 flip-flops (18 buffers of 11 bits) and, in the DCT-kernel, only 26 flipflops are necessary (1 buffer of 12 bits plus 14 bits of adder-subtracter pipelining). Table 2 compares the quantities of flip-flops used by the architecture proposed in [10] and by the DCT-kernel, in which the reduction of 41.66% can be observed. The DCT-kernel has also reduced a single multiplex circuit in stage “E”, as showed by Table 3.

6

Fig. 2. Diagram of the proposed DCT-kernel

The proposed DCT-kernel consists in a state machine composed of sixteen different sub-states. In each of them, some shift registers are enabled and the related operations of the algorithm are performed. It uses a clock with twice the frequency of the input data in order to multiplex the required addition and subtraction operations (each of them occurs during one clock cycle). Table 3. Number of required multiplexes Table 2. Number of required flip-flops Stage Arch. [10] DCT-kernel A 128 73 B 144 90 C 160 88 D 198 26 E 198 78 F 192 70 Total 1020 425

5

Multiplex Arch. [10] DCT-kernel Mux0-A 4x1 4x1 Mux1-A 4x1 2x1 Mux0-B 4x1 4x1 Mux1-B 4x1 4x1 Mux0-C 3x1 2x1 Mux1-C 3x1 2x1 Mux0-D 4x1 4x1 Mux1-D 5x1 4x1 Mux0-E 5x1 4x1 Mux1-E 4x1 Mux0-F 2x1 4x1 Mux1-F 2x1 4x1

The Parametric 2D-DCT Architecture

To implement the 2D-DCT using 1D-DCT kernels, one or more DCT-kernels are reserved to process the rows, while another set with the same number of kernels is dedicated to process the columns. The first ones operate with 8 bits input

7

data and the second one, with 12 bits. The amount of bits is a programable parameter in the kernel structure. A memory is necessary to save the result of the 1D processing along 8 consecutive rows of pixels in a frame. These data constitutes the input for the second 1D processing (along the columns). This memory is commonly referred to as transposing buffer. In many applications, in which the whole frame to be coded is previously stored, this transposing buffer can be just sufficient to store the block of 8 × 8 pixels that is being coded in that moment. However, as the goal of the presented architecture is performing a streamed-based processing for real time applications, the storage of the whole frame does not occur. The parametric architecture offers the possibility of using as many DCT-kernels as necessary to achieve the required throughput for a given application. In the following, topologies resulted from the choice of two different parameters are presented. The first one uses 2 DCT-kernels to attend the standard definition television (SDTV) purposes and the second one employs 4 DCT-kernels to compress video signal in the high definition television (HDTV). In both cases, only the Y component (luminance) of video signal is going to be processed (there are also the chrominance components, Cb and Cr). The architecture for the HDTV can be used in both 720p (progressive scan) and 1080i (interlaced scan). Table 4 shows the main features of SDTV and HDTV standards. Table 4. The mains features of SDTV and HDTV standards Standards SD 720x480i HD 1920x1080i HD 1280x720p

5.1

frames/ second 30/1.001 30 60

sampling 4:2:2 4:2:2 4:2:2

samples of Y/ second 13,500,000 74,250,000 74,250,000

samples of Cb Cr/ second 6,750,000 37,125,000 37,125,000

An Architecture that Attends SDTV Purposes

The first 8 rows of the frame are consecutively processed (in sets of 8 pixels) by just one DCT-kernel. Only when this first step had finished, two 1D processing are simultaneously done, each of them executed by a different DCT-kernel. The first kernel performs the horizontal processing over the following set of 8 rows and the second kernel performs the processing over the columns resultant of the latter horizontal processing. Then, two buffers of size equivalent to 8 complete rows are necessary. These buffers are internal to the FPGA, which means that no external memory is used. The first buffer is needed for the storage of the horizontal coefficients resultant of the 1D processing of the first set of 8 rows. This buffer will save the horizontal coefficients that will be read during the second step (when they will be processed in the vertical direction). The second buffer, by its time, will be filled in (from this second step on) with the coefficients resulted from the horizontal processing of the second set of 8 rows. In this manner, the simultaneous use of the 2 buffers

8

is such that when one of them is being read, the other one is being written. In each set of 8 incoming rows of pixels, these buffers have their functions inverted. Figure 3 shows the main parts of this proposal for implementing the 2D-DCT for SDTV.

Fig. 3. Parallel 2D-DCT block diagram for SDTV

The maximum clock frequency achieved by each DCT-kernel has been 80MHz (12.5ns), as described in Section 4. This limitation is mainly due to the propagation delays of the multiplexes, adders and subtractors. Since both DCT-kernels operate in parallel (there is only a difference of 8 rows to start the processing by the second one), the frequency limit for the parallel architecture using the both kernels keeps the same as that of a particular kernel. As soon as the incoming pixel rate is half this value (as mentioned in the presentation of the DCT-kernel in Section 4), the limit of operation is 40 megapixels/s. This is more than the necessary to attend standard definition video signals (SDTV), for which the pixel rate is 13,5 megapixels/s.

5.2

An Architecture that Attends HDTV Purposes

The rate of 40 megapixels/s is not enough to process high definition video signal, neither in the 720p nor in the 1080i standard. In this case, the required rate is 74,5 megapixels/s. In order to attend HDTV purposes, the present architecture may explore its parametric characteristic, that is, 4 DCT-kernels (instead of 2) may be used in a parallel way. Figure 4 shows the block diagram for HDTV. Two DCT-kernels are utilized to process, each of them, half the pixels in a particular row. They operate alternatively in sets of 8 consecutive pixels, as shown in Fig. 5. In the same manner, the other two DCT-kernels conjunctly perform the vertical processing in the 8 rows of pixels that have been processed in the horizontal direction in the previous step. In this case, the transpose buffer used by each DCT-kernel has been resized to support only half the pixels of 8 consecutive rows (or columns). For this reason, regardless of the use of 4 DCT-kernels (which is twice the quantity used in the previous architecture), the total memory need for the architecture has been kept unchanged.

9

Fig. 4. Parallel 2D-DCT block diagram for HDTV

Fig. 5. 2D-DCT parallel processing data stream for HDTV

6

Verification Results

To verify the proposed architecture, the language VHDL (VHSIC - Hardware Description Language) has been employed in the design and the software Altera Quartus II has been used for simulation and synthesis. The target device was the CYCLONE II of Altera. It has been observed a significative reduction in the number of logic elements in the proposed architecture, if it is compared to other schemes reported in the literature. Figure 6 shows the simulation results of the first unity of DCT-kernel used in the 2D-DCT for the SDTV. The state machine referred to as ‘smp cnt’ changes in the rising edge of the 80 MHz clock signal ‘clk ’. However, a new pixel ‘input’ arrives at each two periods of clock and the first coefficient ‘output’ is only available after the thirty-second clock cycle. In this figure, the set of 8 pixels [0, 3, 6, 9, 12, 15, 18, 21] has generated the set of 8 1D-DCT coefficients [84, 0 , 0, 0, -77, 1, -6, -2]. For HDTV purpose, a control system has been created to synchronize the job division between the 2 DCT-kernels that alternatively process the same set of 8 rows (or columns). Figure 7 shows the control signals ‘in ena 1 ’ and ‘in ena 2 ’ that enable the data input for the first set of two DCT-kernels. The set of 8 pixels [0, 3, 6, 9, 12, 15, 18, 21] is processed by one of the DCT-kernels and generates the set of 8 1D-DCT coefficients [84, 0 , 0, 0, -77, 1, -6, -2], as shown in the sig-

10

Fig. 6. Simulation results for SDTV

nal ‘output 1 ’. The subsequent set of 8 pixels [24, 27, 30, 33, 36, 39, 42, 45] is processed by another DCT-kernel and generates the coefficients [276, 0 , 0, 0,...] (only the first set of four coefficients are shown in Figure 7). The signals ‘output 1 ’ and ‘output 2 ’ are the input data for the two transposing buffers shown in Figure 4. The access to the data to be processed by each of them must also be controlled. This task is done by a demultiplex that distributes sets of 8 consecutive pixels between the 2 DCT-kernels. Since each DCT-kernel processes a set of 8 pixels in the same row, there is no additional delay (latency) due to the parallel architecture. A speedup of 2 has been obtained.

Fig. 7. Simulation results for HDTV

7

Conclusion

The parametric architecture proposed in this paper is expected to be an efficient alternative to provide the pixel rate required by the most of video compression applications. To achieve this goal, it uses as many DCT-kernels as necessary, in a parallel way. The proposed DCT-kernel is a hardware implementation of the 1D-DCT that presents some advantages over classical implementations in the literature, which are a significative reduction of the number of required flip-flops and the use of a smaller transposing buffer. The need for a buffer with lower size in the DCT-kernel avoids the use of extended memory while enables the use of a buffer internal to the FPGA. This reduces the circuit area and also prevents from some hardware faults. When two

11

or more DCT-kernels are used to increase the pixel rate, the capacity required for each individual buffer decreases and the total memory keeps unchanged. It is an important to consider that the internal memory available in the FPGAs is restricted and relatively low if compared to the number of equivalent gates provided by them. This paper has presented two different uses of the proposed parallel architecture. They attend for standard definition television (SDTV) and high definition television (HDTV) purposes, using two and four DCT-kernels, respectively. However, due to its parametric characteristic, this architecture can also provide solutions for other applications that demand higher pixel rates, without changing the hardware device. The number of kernels to operate in parallel is limited to the number of logic elements provided by the device. Then, this architecture promises to be especially useful to the transform approach for video compression in HDTV applications, which are, nowadays, one of the fields with the most expressive appeal for research and development.

References 1. Recommendation ITU-R BT.656-3: Interfaces For Digital Component Video Signals in 525-line and 625-line Television Systems Operating at the 4:2:2 Level of Recommendation ITU-R BT.601 (Part A). (1995) 2. Arai, Y.; Agui, T.; Nakajima, M.: A Fast DCT-SQ Scheme for Images. Transactions of IEICE. V. E71, N. 11, pp 1095-1097, November 1988. 3. Chen, T.C: Analysis and Architecture Design of an HDTV720p 30 Frames/s H.264/AVC Encoder. IEEE Transactions on Circuits and Systems for Video Technology. V. 16, N. 6, pp 673-688, June 2006. 4. Fehn, C.; de La Barr´e, R.; Pastoor, S.: Interactive 3-DTV - Concepts and Key Technologies. Proceedings of the IEEE. V. 94, N. 3, pp 524-538, March 2006. 5. Bhaskaran, V.; Konstantinides, K.: Image and Video Compression Standards Algorithms and Architectures. Kluwer Academic Publishers. Massachusetts, 1999. 14 6. Porto, R.; Agostini, L.: Project Space Exploration on the 2D-DCT Architecture of a JPEG Compressor Directed to FPGA Implementation. Design, Automation and Test in Europe Conference and Exhibition, pp 224-229, February 2004. 7. Jain, A.: Fundamentals of Digital Image Processing. Prentice Hall. USA, 1989. 8. Ahmed, N.; Natrajan, T.; Rao, K.R.: Discrete Cosine Transform. IEEE Trans. Comput. V. C-23, N. 1, pp 90-93, December 1984. 9. Kovac, M.; Ranganathan, N.: JAGUAR: A Fully Pipelined VLSI Architecture for JPEG Image Compression Standard. Proceedings of the IEEE. V. 32, N. 6, pp 247-258, February 1995. 10. Agostini, L.: Information Technology Digital Compression and Coding of Continuous-tone still Images Requirements and Guidelines. M. Sc. Dissertation. Computer Science Graduate Program - Federal University of Rio Grande do Sul, 2002. 11. Chen, W.; Smith, C.; Fralick, S.: A Fast Computational Algorithm for the Discrete Cosine Transform. IEEE Transactions on Communications. V. 25, N. 9, pp 1004-1009, September 1977. 12. Lee, B.: A New Algorithm to Compute the Discrete Cossine Transform. IEEE Transactions on ASSP. V. 32, N. 6, pp 1243-1245, December 1984.

Lihat lebih banyak...

A Real-Time and Parametric Parallel Video Compression Architecture Using FPGA

Descrição do Produto

Comentários