A new architecture to compute the discrete cosine transform using the quadratic residue number system
Descrição do Produto
ISCAS 2000 - IEEE International Symposium on Circuits and Systems, May 28-31, 2000, Geneva, Switzerland
A NEW ARCHITECTURE TO COMPUTE THE DISCRETE COSINE TRANSFORM USING THE QUADRATIC RESIDUE NUMBER SYSTEM J. Ramírez (1), A. García (1), P. G. Fernández (2), L. Parrilla (1), A. Lloris (1) (1)
(2)
Department of Electronics and Computer Technology University of Granada, Spain
ABSTRACT A new methodology to compute the N-point DCT (Discrete Cosine Transform) in the QRNS (Quadratic Residue Number System) is presented, with a significant improvement in complexity and speed compared to the corresponding binary version. This reduction in the total number of arithmetic adders and multipliers is up to 21% and 26% for an 8−point and a 16−point DCT, respectively. In addition, large and slow binary adders and multipliers with a long carry propagation delay are replaced by high-speed small wordlength modular adders and LUT (Look−Up Table) multipliers. When a Field Programmable Logic (FPL) implementation using Altera FLEX10KE devices of the proposed architecture for the 8−point DCT is considered, throughput is three times better than that obtained with the corresponding fixed point 2’s complement binary implementation.
1. INTRODUCTION The introduction of the DCT (Discrete Cosine Transform) in 1974 by Ahmed [1] has created great interest in its applications in digital image and speech coding. This interest derives from the fact that the DCT approaches the statistically optimal Karhunen-Loeve Transform (KLT) while providing a much more efficient computational framework. For this reason, it has been used in most of the coding standards for image compression, including CCITT H.261/H.263, JPEG (Joint Photographic Expert Group) and MPEG (Motion Pictures Experts Group). The DCT of an N−point sequence {x(0), x(1), ..., x(N−1)} is defined by: X (m ) =
2 (2n + 1)mπ Km x(n )cos 2N N n=0 N −1
∑
m = 0, 1, ..., N - 1 K0 =
1 2
(1)
K 1 = K 2 = .... = K N −1 = 1
The existence of fast algorithms to compute the DFT (Discrete Fourier Transform) is the motivation to compute the DCT through the DFT, thus obtaining a reduction in hardware complexity and an increase in speed. Ahmed [1] introduced the first fast algorithm to compute the N-point DCT through the computation of the real part of a 2N-point DFT scaled by a complex exponential constant. The N-length input sequence is padded with zeros to 2N points and, thus, the N-point DCT of the original sequence may be computed through the calculation of a 2N-point DFT with a fast transform algorithm, as derived from:
Department of Electrical Engineering University of Jaén, Spain
X ( m) =
W2 N =
2π −j e 2N
− j mπ 2 N −1 2 K m Re e 2 N x(n) W 2mn N N n=0
∑
x( n) = 0
(2)
n = N , N + 1, ..., 2 N − 1
After this first proposal many new algorithms [2] have been developed for the efficient computation of the DCT with a reduced number of real multiplications and additions. Another possible 2Npoint sequence to compute the DCT through a scaled FFT is the specular reflection of the original N-point sequence. Thus, the Npoint DCT can be computed with 2N·log2(2N) complex MAC operations. Haralick [3] shows that the N-point DCT can be computed with two N-point FFTs obtaining a reduction of 2N complex MACs. The procedure is based on the fact that the FFT of a real even data sequence of length 2N can be obtained by computing two FFTs of N-point data sequences. Tseng and Miller [4] have shown an algorithm with only N/2·(log2N+1) complex operations which calculate only the real part of the FFT of a 2Npoint real specular-reflection sequence. Narasimha and Peterson [5] have proposed a more efficient algorithm to compute the N-point DCT with a rearrangement of the input sequence and a radix-2 algorithm for the DFT. With this algorithm, the resulting number of complex multiplications is (N·log2N−N+1)/4, which leads to an implementation that is twice as fast as the previous proposals. This latter algorithm is described in detail in section 3 and is the objective of the subsequent binary and QRNS implementations. The use of complex arithmetic to compute the DCT makes use of the QRNS (Quadratic Residue Number System) [6] as an enabling tool due to its high efficiency in computing complex multiplications with only two modular multiplications and no additions at all. On the other hand, RNS [7] has been shown to be an efficient arithmetic system for the implementation of DSP applications [6]. The advent of new FPL (Field Programmable Logic) device families, such as Altera FLEX10K [8], has introduced new possibilities for the fast implementation of RNS-based systems. The synergy between FPL devices and RNS has been shown with the realization of a special class of digital filters [9]. Moreover, a fast implementation of the DCT using RNS and FPL devices has recently been reported [10], with improved results over its binary counterpart. In this paper, a regular architecture composed of parallel and independent QRNS channels for the computation of the DCT is presented, showing a significant reduction in hardware complexity over the traditional binary approach. This is materialized in a reduction in the number of adders and multipliers required. In addition, an implementation on FPL devices (previously modelled at structural level using VHDL) of the 8-point transform is
0-7803-5482-6/99/$10.00 ©2000 IEEE
V-321
included both for the QRNS-based system and the binary alternative. The throughput for the QRNS-based system is three times better than that of the corresponding fixed-point 2’s complement binary implementation.
(a + , b − ) a= 2
−1
(a
+
-1
QRNS →
+b
−
)
a + jb
m
(
; b = (2r ) a + − b − −1
)
(7) m
Different techniques have been examined to solve the performance degradation in the conversion stage. The RNS-to-binary conversion is usually the main throughput limitation of RNS-based systems. As in image coding the DCT need to be coded, the use of a special class of auto-scaling RNS to binary converters [11] is examined in this paper to maintain the high performance of the proposed QRNS-based DCT architecture.
According to (6), addition and multiplication in the QRNS are computed with only two modular operations. In this way, DSP algorithms involving a great number of complex multiplications, such as the DFT, are more efficiently computed over a QRNS system.
2. QRNS BACKGROUND
The procedure for the QRNS-based DCT computation is as follows. Initially, the N-point input sequence {x(0), x(1), ..., x(N−1)} is reordered in the sequence {y(0), y(1), ..., y(N−1)} defined by:
An RNS (Residue Number System) [6, 7] is defined by a set of positive pairwise relatively prime integers {m1, m2, ..., mL} called moduli. Any integer X ∈ [0, M−1] is represented by the L−tuple [x1, x2, ..., xL], where x i = x m (i= 1, 2, ..., L) is the residue of X
3. QRNS-BASED DCT COMPUTATION
y (n ) = x (2n )
modulo mi, while M, the dynamic range of the RNS system, is given by: L
M =
mi ∏ =
(3)
i 1
M
=
[ x ◊y
1 m , ..., 1
1
xL ◊ y L
mL
]
(4)
Z (n ) = H n Y (n ) =
→
a + = a + rb
b − = a − rb
m
+
)(
(a (c
(5)
b−
, b− ◊ c+ , d − = a + ◊ c +
−
m
, b ◊d
− m
(6)
where ◊ represents addition, subtraction or multiplication. The inverse QRNS mapping is given by:
+
, b−
+
,d−
) )
2π 4N
⇔
(9)
(10)
(g , h ) (i , j ) +
+
(e a+
m
)
-j
is used, it is only necessary to compute N/2+1 values of Z(n). Thus, the N/2+1 set of points {Z(0), Z(1), ..., Z(N/4), Z(N/2), Z(N/2+1), ..., Z(3N/4−1)} is proposed to minimize the number of operations in the N-point DCT computation. With this set, N−3 fewer adders
Arithmetic in the QRNS is defined as follows:
(a
W4 N = e
Re[Z (N - n )] = −Im[Z (n )]
QRNS exists if the prime factor decomposition of m has only primes of the form 4K+1 or, equivalently, if the equation x2+1 can be factorized as (x−r)(x+r) with r∈{0, 1, ..., m−1}. Let a+jb be a complex number with 0 ≤ a, b < m and r a root of the equation x2+1= 0 in the ring of integers modulo m. Thus, QRNS is defined by the isomorphic mapping:
a + jb
2 K n W 4nN Y (n ) N
Z (N − n ) = − jZ * (n )
An extension of RNS is the QRNS [6], which is defined by an isomorphic mapping between the ring of complex integers modulo m and the set of pairs of elements in the ring of integers modulo m. Complex integer multiplication is efficiently computed in the QRNS with only two parallel modular multiplications. In this way, algorithms involving intensive complex multiplications, such as the FFT, are efficiently implemented.
a + , b −
(8)
If the following property:
where ◊ represents either addition, subtraction or multiplication.
QRNS
N −1 2
Let {Y(0), Y(1), ..., Y(N−1)} be the DFT of the sequence {y(0), y(1), ..., y(N−1)}. It can be shown that the DCT {X(0), X(1), ..., X(N−1)} of the original sequence is obtained [5] through the real part of Z(n), defined as follows:
Arithmetic in the RNS is defined over the ring of integers modulo M. For 0≤X, Y, Z< M: Z = X ◊Y
n = 0, ...,
y (N − n − 1) = x (2n + 1)
i
− +
,f
−
)
+
∗
−
−
+m
g+
+m
h−
c+
−m
×e+
d−
−m
×f
m
− m
i+
j−
Figure.1. A QRNS butterfly for a radix-2 FFT.
V-322
and N/2−2 fewer multipliers are required. In this way, using this set and taking into account the property given in equation (10), the Npoint DCT of the sequence {x(0), x(1), ..., x(N−1)} is given by {Re[Z(0)], Re[Z(1)], ..., Re[Z(N/4)], −Im[Z(3N/4−1)], −Im[Z(3N/4−2)], ..., −Im[Z(N/2+1)], Re[Z(N/2)], Re[Z(N/2+1)], ..., Re[Z(3N/4−1)], −Im[Z(N/4)], −Im[Z(N/4−1)], ..., −Im[Z(1)]}.
y(0) y(1) y(2) y(3)
In a QRNS implementation of the previous procedure, complex binary adders and multipliers are replaced by QRNS adders and multipliers in the computation of the scaled radix-2 DIF (Decimation In Frequency) FFT given in (9). According to (6), each QRNS adder and each QRNS multiplier consists of two modular adders and two modular multipliers, respectively. However, the fact that the input sequence is real reduces each QRNS adder with real inputs to only one modular adder. A normal QRNS butterfly for the computation of a radix-2 DIF DFT is shown in Figure 1. It consists of a QRNS adder (two modular adders), a QRNS subtractor (two modular subtractors) and a QRNS multiplier (two LUT modular multipliers). The architecture for a prearranged dynamic range is composed of a sufficient number of parallel channels with pairwise relatively prime moduli satisfying the QRNS mapping requirements. For an N-point DCT-QRNS implementation, the number of real input QRNS adders required is: A=
log 2 N −1
∑ = i 0
N 2i
(11)
Thus, a direct calculation of the modular adders (MA) and LUT modular multipliers (MM) required leads to: MA = 2 N (log 2 N − 1) − A + 6 MM = N (log 2 N − 1) + 5
(12)
Table 1 shows the binary adders and multipliers required for the binary 8-point and 16-point DCT, as well as the modular adders and multipliers required for the QRNS alternative. Reduction is up to 21% and 26% in the total number of arithmetic adders and multipliers for the computation of an 8−point and a 16−point QRNS DCT, respectively. Furthermore, a speed increase is obtained, since the QRNS alternative does not require large multipliers or adders with long carry propagation delay, as does the binary solution, and a LUT is used for each modular multiplier and high speed parallel modular adders replace binary adders in each QRNS channel. Binary and QRNS implementations of the 8-point DCT were carried out on FPL devices in order to compare hardware
+
+
+
+
−
+
− −
+
y(4)
−
y(5)
−
y(6)
−
y(7)
−
0
∗
2
∗
W8
W8
+
∗
+
+
0
∗
+
1
−
2
∗
3
∗
W8 W8 W8 W8
H0 H4 H2
H1 H5
∗
Z(0)
∗
Z(4)
∗
Z(2)
∗
Z(1)
∗
Z(5)
Figure. 2. Pipelined QRNS DCT implementation. complexity and speed performance. Altera FLEX10KE [8] devices were used for the practical implementation of both alternatives. First of all, a 2’s complement binary implementation of the radix-2 DIF FFT-based DCT was carried out. The 8-point input sequence {x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7)} is initially represented with 8-bit precision, while each complex “twiddle” factor and each scale factor are represented with 10-bit precision. The precision is extended as required in each stage, obtaining a 32-bit dynamic range for the DCT computation. The necessary registers between arithmetic instances for a pipelined implementation are added. The resulting 2’s complement binary architecture to compute the DCT has six latency cycles.
Note, in the case of an 8-point sequence, only the set {Z(0), Z(1), Z(2), Z(4), Z(5)} is required for DCT computation. In this way, the 8-point radix-2 DIF QRNS FFT is simplified and five QRNS adders and two QRNS multipliers are saved. Figure 2 shows the algorithm used to compute the 8-point DCT. Each arithmetic instance is a QRNS component consisting of two modular adders, two modular subtractors or two LUT modular multipliers. The two modular adders normally present in each of the 14 real input QRNS adders are reduced to only one.
4. COMPARISON WITH THE BINARY IMPLEMENTATION
+
N
Binary adders
Binary multipliers
MA
MM
Averaged reduction
8
30
27
24
21
21%
16
94
75
72
53
26%
Table 1. Number of arithmetic modules for N-point DCT binary and QRNS implementations. For the DCT-QRNS implementation, the 32-bit dynamic range is obtained with four 8-bit moduli {221, 229, 233, 241}, while the roots for the isomorphic mapping are {47, 107, 89, 177}, respectively. 8-bit modulus parallel channels are chosen in order to ensure fast internal QRNS processing. On the other hand, it is previously necessary to convert the 8-bit real inputs {x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7)} to the QRNS. According to (5), conversion of a real integer to the QRNS is easily simplified to just a residue computation only requiring an 8-bit comparator and an 8bit subtractor. Thus, conversion to QRNS with an 8-bit modulus set is easily accomplished with few resources. Altera FLEX10KE [8] devices were used to implement the QRNS and binary DCTs. These devices contain an embedded array to implement memory and specialized logic functions, and a logic array to implement general logic. The embedded array consists of a series of EABs (Embedded Array Blocks). Each EAB provides a 4K-bit memory. The logic array consists of LABs (Logic Array
V-323
Blocks). Each LAB includes eight LEs (Logic Elements). A LE provides a four input LUT, a programmable flip-flop and dedicated signal paths for carry and cascade functions. The results for both architectures are shown in Table 2, with throughput measured in millions of DCTs per second (MDCTPS). Several facts should be noted: •
•
•
Approximately the same number of LEs are required for both implementations, 2497 for the binary and 2308 for the fourchannel QRNS system. The limited number of LEs required by the binary implementation, despite the large number of binary multipliers, is due to the constant input to each binary multiplier. In the QRNS implementation, the memory available in the FPL devices is exploited to simplify and accelerate the task of multiplying two complex numbers. Thus, 14 EABs are required for each channel. QRNS implementation provides a higher throughput due to the efficiency of the QRNS multiplication through LUTs. Specifically, for grade −1, −2, and −3 speed FLEX10KE devices, QRNS performance is 3.42, 3.38 and 3.31 times better than binary, respectively. Binary implementation
QRNS implementation
# LE
2497
4×577
# EAB
0
4×14
−1
27.02
92.59
−2
23.64
80.00
−3
17.85
59.17
Throughput for several device speed grades (MDCTPS)
5. SUMMARY RNS has been revealed as a very useful arithmetic system for the development of high-performance DSP applications over FPL devices. In particular, residue arithmetic has been used to compute the Discrete Cosine Transform with a reduction in hardware complexity and with a significant increase in performance. This paper shows that the QRNS is an efficient tool for the development of high-performance DCT processors based on complex arithmetic DFTs. The computational algorithm considered leads to a substantial reduction in the number of arithmetic modules required for the DCT computation and, thus, a reduction in hardware complexity. It has been shown that the reduction in the total number of adders and multipliers is in the range 21-26% for 8-point and 16-point QRNS DCT implementations. The 8-point 1D-DCT processor and its binary counterpart were implemented on Altera FLEX10KE FPL devices, resulting in a throughput improvement for the QRNS approach of up to 242%.
6. ACKNOWLEDGEMENTS The authors were supported by the Dirección General de Enseñanza Superior (Spain) under project PB98-1354.
7. REFERENCES [1]
Table 2. Throughput and resource usage for a QRNS DCT implementation over FLEX10KE devices. The practical implementation of RNS-based systems encounters a difficulty in the conversion stage. Different solutions were addressed to overcome the conversion drawback. Binary-to-RNS and RNS-to-QRNS conversions are fast operations since the input and the selected moduli are 8 bit width. RNS-to-QRNS and QRNSto-RNS conversions only require one adder, one substractor and one LUT. However, conversion from RNS to binary implies the use of 32-bit word-length modular adders and large multipliers. Thus, a direct implementation of the Chinese Remainder Theorem (CRT) results in excessive hardware usage and slow performance. The auto-scaling RNS to binary converter (ε-CRT) proposed by Griffin et al in [11] allows to overcome these drawbacks using only LUTs and conventional binary adders. For a scaled n-bit binary output and a k-bit modulus set, this converter needs one 2k×n LUT per modulus and a binary n-bit adder tree to add the LUT outputs. Using pipeline in the binary adder tree the performance of the εCRT converter with 24, 16 and 8 bits output improves the performance of the QRNS architecture for several speed grade devices and particularly, this can be maintained with three pipeline stages in the adders for a 24-bit output. Thus, simulations showed that the high performance of the proposed QRNS-based DCT architecture is not degraded when converters are added.
N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete Cosine Transform”, IEEE Trans. on Computers, vol. C-23, pp. 9093, Jan. 1974. [2] K. R. Rao, P. Yip, Discrete Cosine Transform. Algorithms, Advantages, Applications. Academic Press. Inc, 1990. [3] R. M. Haralick, “A Storage Efficient Way to Implement the Discrete Cosine Transform”, IEEE Trans. on Computers, vol. C-25, pp. 764-765, June 1976. [4] B. D. Tseng and W. C. Miller, “On Computing the Discrete Cosine Transform”, IEEE Trans. on Computers, vol. C-27, pp. 966-968, Oct. 1978. [5] M. J. Narasimha and A. M. Peterson, “On the Computation of the Discrete Cosine Transform”, IEEE Trans. on Communications, vol. COM-26, pp. 934-946, Jun. 1978. [6] M. Soderstrand, W. Jenkins, G. A. Jullien, and F. J. Taylor, Residue Number System Arithmetic: Modern Applications in Digital Signal Processing. IEEE Press Reprint Series. IEEE Press, 1986. [7] N. Szabo and R. Tanaka, Residue Arithmetic and its Applications to Computer Technology. McGraw-Hill, 1967. [8] Altera Corporation, “1998 Data Book”, Jan. 1998. [9] A. García, U. Meyer-Bäse and F. J. Taylor, “Pipelined Hogenauer CIC Filters Using Field-Programmable Logic and Residue Number System”, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Seattle, WA, vol.5, pp. 3085-3088, May 1998. [10] P. G. Fernández, A. García, J. Ramírez, L. Parrilla and A. Lloris, “A New Implementation of the Discrete Cosine Transform in the Residue Number System”, in 33rd Asilomar Conf. on Signals Syst. and Comp., CA, 1999. [11] M. Griffin, F. J. Taylor, and M. Sousa, “New scaling algorithms for the Chinese Remainder Theorem.”, in 22nd Asilomar Conf. on Signals, Syst .and Comp., CA, 1988.
V-324
Lihat lebih banyak...
Comentários