Novel high-radix residue number system architectures

June 8, 2017 | Autor: Vassilis Paliouras | Categoria: Residue Number System, Time Complexity, Electrical And Electronic Engineering, Complexity reduction

Share Embed

Denunciar este link

Descrição do Produto

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

1059

Novel High-Radix Residue Number System Architectures Vassilis Paliouras, Member, IEEE, and Thanos Stouraitis, Senior Member, IEEE

Abstract—Novel radix- moduloarithmetic units for residue number system (RNS)-based architectures are introduced in this paper. The proposed circuits are shown to require several times less area than previously reported architectures for particular moduli of operation, while also being preferable in the area time complexity sense. The complexity reduction is achieved by extending the carry-ignore property of modulo-2 operations to radices higher than two, which are not powers of two. The carry-ignore property is efficiently exploited by introducing simplified digit adders, instead of general radixadders. The proposed simplification of digit adders is possible, since the maximum values of certain intermediate digits produced in the architecture are found to be less than 1. Detailed area and time complexity models are derived for the arithmetic units. The proposed radix- architectures include multipliers, adders, and merged multipliers-adders. In addition, efficient radixbinary-to-residue and residue-to-binary conversion techniques and architectures are introduced. Index Terms—Addition, multiplication, residue arithmetic, residue conversion, residue number system.

I. INTRODUCTION

A

MONG alternative digital representations, the residue number system (RNS) [1], [2] has attracted a lot of interest, since it features highly parallel carry-free addition and multiplication and borrow-free subtraction. RNS architectures are particularly suitable for the implementation of multiplication/addition-intensive algorithms, since the efficiency of RNS multiplication/addition may compensate the unavoidable conversions’ overhead. Additionally, RNS architectures are inherently fault isolating and allow for easy error detection and correction. However, being a nonweighted number representation, RNS suffers from complicated division, magnitude comparison, sign and overflow detection, which prohibit its wide-spread use in general-purpose computer systems. Various VLSI architectures have been proposed for RNS operations; they rely on memory table lookup, combinatorial logic, or combination of both [3]. Memory-based designs are regular and modular, hence they are amenable for VLSI implementation. While small moduli lookup table-based architectures are more efficient, the situation changes for large moduli [4]. Combinatorial RNS architectures are of interest, as they can become more efficient than table look-up architectures with increasing modulus size [5], [6]. Recently, a variety of adder-based architectures have been proposed in the literature Manuscript received April 1999; revised June 2000. This paper was recommended by Associate Editor N. Ranganathan. The authors are with the VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, 26500 Patras, Greece. Publisher Item Identifier S 1057-7130(00)09324-1.

[4]–[6]. Piestrak presented a technique for building residue generators and multi-operand adders, featuring the exploitation of the periodicity of the residues of power of two offered by certain numbers as moduli [7]. The architectures utilize radix-2 arithmetic. Stouraitis et al. [6] presented full adder-based architectures for RNS multiply–add operations. DiClaudio et al. introduced the pseudo-RNS representation, which enables building reprogrammable-modulus multipliers, allows for systolization, and simplifies the computation of digital signal processing (DSP) algorithms, such as finite impulse response (FIR) filtering, correlations, and discrete Fourier transform (DFT) [5]. High-radix techniques for the acceleration of multiplications have been proposed in the literature. Takagi presented a radix-4 modular multiplication hardware algorithm suitable for large-modulus operations [8]. Soderstrand and Escott have proposed merging of RNS with the multiple-valued logic paradigm, aiming at the VLSI implementation of FIR filters. In their approach, the selection of the number of levels to represent multiple-valued logic signals is consistent with the modulus of operation, providing a natural representation for and RNS [9]. Bases which include moduli of the form have been widely used in RNS architectures, as they provide time for low-complexity implementations [10]. The area efficiency of the corresponding arithmetic units stems from the short word length of the operands (as implied by RNS) and their resemblance of simple conventional binary arithmetic architectures. However, to comply with the RNS requirement for pair-wise relatively prime moduli in the base, no more than can be utilized. Therefore, not all one modulus of the form residue channels can exploit the efficiency of architectures for operations modulo . In this work, new architectures for multiplication, addition, merged multiplication-addition, and conversion are introduced, based on the extension of the carry-ignore property to radices , . It is shown that significant complexity reduction is achieved by modulo- multipliers, compared to previously reported combinatorial designs [5], [6], without adopting multiple-valued logic. The exploitation of the introduced architectures circumvents the restriction of using a single modulus channel for which the carry-ignore property is applicable. This , such that is achieved by using different moduli of the form (1) Hence, more than one carry-ignore channels can be included in an RNS architecture. Notice that the remainder of the moduli channels can be implemented using any of the conventional radix-2 architectures.

1057–7130/00$10.00 © 2000 IEEE

1060

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

To exploit the carry-ignore property, radix- digit adders of low complexity should be developed. In the proposed arithmetic units, low complexity becomes feasible, as it is shown that the digits of the intermediate results do not assume the full range of values possible for a radix- digit. In particular, the general radix- adders required in a straightforward radix- architecture are reduced to adders, which do not need to manipulate the full range of single-digit radix- values. Therefore, simpler hardware is required for their implementation. The derived maximum values do not depend on the modulus of operation and, in some cases, do not depend on the radix ; therefore, their applicability is extended beyond modulo- arithmetic units. To illustrate the impact of the proposed arithmetic units on a complete RNS-based architecture, design techniques for the derivation of radix- binary-to-residue and radix- residue-to-binary converters are also introduced. The binary-to-radix- residue converter architecture is based on specialized cells of low complexity, while the radix- residue-to-binary conversion can be performed by modifying an existing architecture [11]. However, it is shown that introduced adder-based structures can also be utilized in the radix- residue-to-binary conversion to improve area time efficiency. The selection of a particular architecture to implement a residue channel should minimize a performance measure such time complexity. To facilitate the as area, time, or area selection procedure, area and time complexity bounds for the proposed architectures are derived. The remainder of the paper is organized as follows. Some RNS and radix- arithmetic basics are offered in Section II. Section III introduces the modulo- radix- multiplier architecture, Section IV focuses on modulo- radix- addition, and Section V focuses on merged radix- modulo- multiplication-addition. Binary-to-radix- residue and radix- residue-tobinary conversion architectures are discussed in Section VI. Performance comparisons to previously published architectures are presented in Section VII. Finally, conclusions are discussed in Section VIII. II. REVIEW

OF

BASIC RNS CONCEPTS ARITHMETIC

AND

RADIX-

r> n p

Fig. 1. The architecture of the proposed high-radix multiplier ( 3, = 5). Links from the preprocessor to FAs, which correspond to digits , are not shown.

particular, let and be the RNS representations of the integers and , respectively. Then the RNS representation of the product, the sum, or the difference of the two tuples denoted , is given by as (5) (6) where denotes addition, subtraction, or multiplication. In this paper, the processing of integers modulo is investigated. Assume an integer of radix- word length , such that (7) where is

for

. Then the residue of

modulo

(8)

An RNS is defined through a set of integers (2) called the base. The integers i.e.,

are pair-wise relatively prime, (3)

for that every integer , sented as an -tuple

. Restriction (3) is necessary to assure , can be uniquely repre(4)

denotes the residue of modulo . The main benwhere efit of adopting RNS to perform arithmetic operations is that it allows for the totally parallel addition, subtraction, and multiplication of operands expressed as -tuples of the form (4). In

, when and , for . since , do not contribute to the residue Therefore, the digits , . It is the combined exploitation of the carry-ignore property and the low-complexity radix- reduced digit adders that leads to the substantial performance improvement for processing of residues modulo . III. THE PROPOSED RADIX- MULTIPLIER MODULO An organization for the proposed high-radix multiplier is shown in Fig. 1. Let the multiplier and (HRM) modulo of two integers and the multiplicand be residues modulo . They are given via (8) as (9)

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

1061

(10) are radix- digits, . where , Then, the function of the HRM, i.e., the computation of the , can be described by product (11) of the digits of the multiplier and The products multiplicand, are two-digit radix- values

of the

(12) is the carry digit and is the sum digit. Since , for , it follows that the products for which , do not contribute to . Similarly, any or more, does not contribute other digit having a weight of to a residue modulo . Such digits are: ; 1) the carries produced by adding digits of weight of the products for 2) the most significant digit . which operation in (11) does not impose Therefore, the modulo any computational—and subsequently hardware—complexity on the HRM architecture. The HRM consists of two stages, namely the preprocessor, which generates the two-digit prod, and the adder array, which sums the outputs of ucts the preprocessor as dictated by the double summation in (11). Several types of digit adders are developed and used in this adder array. Their usage is justified in Section III-B.

where

A. The Preprocessor and is organized The preprocessor computes the products as a collection of cells, operating in parallel, each cell of which , given by (12). While may produces a two-digit result , the carry assume all radix- legitimate values, i.e., may only assume a limited set of values. Since the maxdigit is less than , imum value which can be assumed by the binary word length required to represent a carry digit is reduced; thus hardware complexity is reduced due to minimizing both wiring and processing circuitry complexity. The possible are computed by the folmaximum values of carry digits lowing Theorem. Theorem 1: The maximum value of the carry digit generated . at a preprocessor cell is Proof: Since the multiplier and the multiplicand are enof the product coded in radix , the maximum value in (12) is

Fig. 2. A possible layout of the a radix-5 modulo-125 multiplier, occupying 698 m 546 m.

2

TABLE I PERFORMANCE OF PREPROCESSOR CELL TYPES FOR VARIOUS RADICES. THE CELLS HAVE BEEN OPTIMIZED WITH MENTOR GRAPHICS’ AUTOLOGIC II, IN A 0.7-m STANDARD-CELL LIBRARY

where and are the carry and sum digits, it follows . that the maximum value of the carry digit is Hence, for a radix-3 multiplier preprocessor, it is obtained that the maximum value of the produced carry digits is , i.e., a single-bit value. Two types of preprocessor cells are required, since the cells which multiply the most significant digits of the multiplier and the multiplicand, need not produce a carry digit, due to the carry-ignore property of a modulooperation. Cells PP produce a carry, while cells of type CP do not produce a carry digit. The performance of the two types of and is summarized preprocessor cells for radices in Table I. The area complexity is quantified in terms of the area occupied by a 2-input NOR gate, denoted as , while the time complexity is normalized to the corresponding gate delay . Efficient preprocessor cells can be obtained by synthesizing and optimizing a VHDL description using commercial logic synthesis tools such as Mentor Graphics’ Autologic II [12]. B. The Radix- Digit Adders of the Adder Array

(13) Since

can be written in radix- form as (14)

generated by the preprocessor, are summed The products in the second stage of an HRM, which is an array of radix- digit adders, as shown in Fig. 1. The introduced array architecture is modeled after the carry-save adder array [13] and it consists of several types of digit adders organized into columns. As carry digits of different maximum value are produced in the HRM,

1062

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

the need for minimal hardware and time complexity dictates that several types of digit adders should be employed to compose the and are adder array. Digits with maximum values produced at the preprocessor due to Theorem 1; hence they are added by digit adders at the top of each column. These adders, the input digits to which have a maximum value of either or , are called full adders (FAs) and produce carries of maximum value 2, as proven in Theorem 2 below. Therefore, in the or , capable next most significant column, digit adders of processing carries of maximum value 2 are placed below the FAs. The particular digit adders produce a carry of maximum value one. This can be proven as in Theorem 2, and the proof or , which process single-bit is omitted. Finally, adders carries, are placed at the lowest part of the column. Hence, a column consists of adders, which process digits of maximum , , 2, or 1. All types of digit adders have at value least one input which can receive a digit spanning the range , to accommodate a sum digit and—in this way—to reflect the organization of the array into columns. Theorem 2: The maximum value of a carry generated by an , and one, if . FA cell is two, if Proof: The maximum two-digit result produced by a digit adder in the array is

TABLE II MAXIMUM VALUE OF THE CARRY DIGIT PRODUCED AT EACH ADDER TYPE. THE DASHES IN THE RADIX r = 3 COLUMN DENOTE THAT ACCORDING TO THEOREM 2, THE CORRESPONDING CASES CANNOT OCCUR

TABLE III AREA AND TIME COMPLEXITY OF DIGIT ADDER TYPES FOR VARIOUS RADICES. EACH DIGIT ADDER HAS BEEN OPTIMIZED WITH AUTOLOGIC II, IN A 0.7-m STANDARD-CELL LIBRARY. EMPHASIS HAS BEEN GIVEN ON REDUCED TIME COMPLEXITY

(15) then an integer exists, Two cases are distinguished. If . Hence, (15) can be written as such that (16) , the generated carry digit may assume a maximum Since then (15) gives value of two. If (17) Hence, in a radix-3 architecture the maximum value of a carry digit generated by an FA equals one. The maximum value assumed by the carry digits generated by each possible digit adder type, can be found in a manner similar to that of Theorem 2. The particular maximum values are shown in Table II. Determining the maximum value of the carry is important, since it allows for employing fewer bits to represent the carry digit than the general radix- digit. In this way, the related hardware complexity is minimized, because wiring is reduced and the circuits, which produce or use the carries, are simplified. The performance of the various types of digit adders is shown in Table III. The operation of each digit adder is described in the following. 1) FA: adds two radix- digits and a carry generated at the preprocessor. : adds a radix- digit and two carries generated by 2) FAs. : adds a radix- digit to a carry generated by an FA. 3) : adds a radix- digit with two carries generated by 4) , , , or digit adders. : adds a radix- digit with one carry generated by 5) , , , or digit adders.

Fig. 3. General organization of the various types of digit adders.

All types of digit adders can be further reduced, when located th column of the array, since no carry digit needs at the to be produced by adders of the particular column due to the carry ignore property. Reduced adders are denoted as the primed versions of the corresponding adder types and their complexity is displayed in Table III. The large variations in the performance of the various digit adder types noticed in Table III, reveal the practical importance of employing several types of adders to exploit the limited set of values assumed by the carry digits. The general organization of a digit adder requires two stages and is shown in Fig. 3. The first stage is a radix-2 three-operand adder, while the second one is a combinatorial logic structure, which translates the intermediate result produced by the first stage into a two-digit radix- legitimate result, composed of a radix- carry and a radix- sum digit. The two stages can be merged into a combinatorial logic implementation, especially for digit adders of lower complexity such as

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

adders. Both the first and the second stage can be reduced by exploiting don’t-care input combinations or by merging the two . For example a stages for simple types of adders, such as radix-5 FA receives a two-bit carry and two three-bit digits, the maximum value of which is 4. Therefore digit values 5–7 are don’t-care combinations and can be used in logic minimization. Furthermore, the maximum value of the intermediate result , which allows produced by the first stage, is the don’t-care input combinations 12–15 for the optimization of the second stage. The complexities shown in Table III are obtained by optimizing a gate-level netlist with Autologic II. C. Complexity of the High-Radix Multiplier Modulo In this section, the complexity of an HRM is quantified by deriving the number of the various adders and cells which it comprises. The first stage of the HRM is the preprocessor, which digit products, such that computes the . Equation (11) reveals that products correspond preprocessor cells are to the th digital position; therefore required at the particular digital position. The preprocessor area is the sum of the complexity of the PP complexity cells, which produce a carry digit, and the complexity of the CP cells at the th column, which do not produce a carry digit, i.e., (18) where

and

can be computed as (19)

and

1063

Therefore, the total number of FAs in an array of a moduloHRM is (22) th column, the most significant one of Notice that the FA digit adders. To derive the array, is composed of the number and type of the remainder of the digit adders which—along with FAs—compose a column, the number and type of carry digits produced at the th column, are determined in the following section. 1) Types of Carry Digits in the HRM- Array: Two types of carry digits are produced within the adder array, those that can assume a maximum value of two and those with maximum value of one. The particular maximum values do not depend on and denote the number of the adopted radix . Let carry digits with maximum value of one and two, respectively, which are added by the th column of adders. The carries of maximum value 2 to be added at the th column, are produced th column. Hence, from (21), it is only by the FAs of the obtained that FA

(23)

and . for Carries of maximum value one are produced by two types of digit adders: 1) the digit adders which process the carry digits (having a maximum input value of two) produced by FAs in the nd column; 2) the digit adders which process one-bit carries generated nd column, when . at the th column FA carries generated by In the one-bit carries are processed. Hence, FAs and can be computed in the following recursive way:

(20) , The CP cells are used because at the digital position of a product needs to be computed due to only digit the carry-ignore property. Furthermore, the products of weight are ignored. greater than The second stage of the HRM is, as expected, an adder columns, indexed by array. The adder array consists of . The th column adds digits of weight . least significant In particular, the th column adds the products of weight and the most significant digits of the products of weight . digits of the Hence, if a number FA of FAs are required and organized into the th column, the top adder of which column receives two digits, while each of the remaining adders in the column receives one, then the number FA can be evaluated as follows:

FA FA

(21)

(24) with the initial condition . By induction, the for following theorem is derived. Theorem 3: In the radix- carry-save adder array organized into columns, the number of carry digits with maximum value of one added in the th column, is given by (25) . Proof: The proof is offered in Appendix A. 2) Organization of the Adder Columns: The numbers and of the two types of carry digits added by the th column, given by (25) and (23) respectively, are utilized in the computation of the number and type of digit adders, which compose a column. Apart from the FAs, which manipulate digits from the preprocessor, the th column comprises adders for processing the two types of carry digits generated at the less signifth column. The number of digit adders icant for

1064

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

TABLE IV THE NUMBER OF REQUIRED DIGIT ADDERS IN THE ADDER ARRAY OF AN HRM-r . THE REDUCED ADDERS OF THE (n 1)TH COLUMN ARE NOT INCLUDED; HOWEVER THEIR NUMBER CAN BE OBTAINED BY (21) AND (26)–(29)

0

where the s and s denote the complexities of the various digit adders and preprocessor cells. 4) The HRM- Delay: The delay of the path from the innd column to the output of the th puts of the column is given as FA (31) while the delay along the most significant column is FA (32)

in the th column, which add two carries of maximum value two and a radix- digit, is given by (26) is odd, an digit adder is required to process In case is given by the remaining carry digit. Hence, (27) In a similar way, it can be found that the number digit adders in the th column is

of

(33) (28)

and the number of

where the s denote the preprocessor cell and digit adder delays. Delay values for digit adder and preprocessor cell implementations using a 0.7- m CMOS standard-cell library are shown in Tables I and III. Depending on the particular gate-level implementation of the various types of digit adders, the radix , and the power , the th maximum-delay critical path is defined by either the nd column to the column or the path from the top of the st one. Therefore, the longest path delay is output of the given by

IV. RADIX- MODULO-

ADDITION

Let and be residues modulo of two integers and , as defined in (9) and (10). Then the residue of their is sum modulo

digit adders is given from (29)

. where Equations (21) and (26)–(29) provide the number and the type of digit adders per digital position (digit adder column in the particular architecture). Since, by virtue of the carry-ignore property, the number of columns is known, exact complexity formulas can be derived by computing the sums of all adder column . The particular complexities over the range complexities are computed in Appendix B and the derived formulas are summarized in Table IV. 3) The HRM- Area Complexity: The total hardware comof the HRM- is the weighted sum of the plexity number of each digit adder type in all columns as well as the complexity of the preprocessor given by (18). Hence, by taking into consideration the complexities of the digit adders shown in Tables I and III, the following expression of the total HRMcomplexity is obtained:

(34) An architecture for the radix- modulo- adder is based on digit adders, named Complete Full Adders (CFAs). An CFA sums two radix- digits and a single-bit input carry to produce a full radix- sum digit and a single-bit carry out digit. The carry out digit is always a single-bit quantity, independently from the radix of operation , because the maximum two-digit result produced by a CFA is (35)

FA (30)

Let CFA be a CFA which does not generate a carry, and should be placed in the most significant column, due to the carry-ignore property. A radix- architecture for the implementation of (34) radix- CFAs and a reduced one, which does comprises not generate a carry digit, as shown in Fig. 4. Therefore, the area

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

Fig. 4.

r

A high-radix modulo-

1065

adder modeled after the ripple-carry adder.

Fig. 6.

n = 5).

r

Organization of a general radix- merged multiplier-adder (

r>

3,

The preprocessor complexity is identical to the complexity of a multiplier preprocessor, i.e., (40)

n

Fig. 5. The organization of a radix-3 merged multiplier-adder ( = 4).

complexity given as

of a two-operand radix- modulo-

The total number of FAs is given as

adder is FA The total number of

while the maximum delay of

(41) digit adders is given as

is (42) The total number of

digit adders is given as

V. MERGED RADIX- MULTIPLICATION-ADDITION (43)

A. The Radix-3 Case , , and denote residues modulo . A Let radix-3 merged multiplier-adder, which performs the operation

The complete area complexity of a radix-3 merged multiplieradder is given as

(36) , can be built using three types of digit adders, namely FA, , as shown in Fig. 5. and The required number of FAs in the th column is given by FA

(44)

(37)

, as, in addition to the preprocessor where digits, a digit from the addend is also considered. The number and digit adders per column are given as of (38) (39)

B. General Radix- Merged Multiplier Adder Merged Multiplier-Adder The general radix- modulo(MMA) can be derived from the corresponding multiplier architecture by introducing an additional CFA at each column. The organization of the radix- MMA is shown in Fig. 6. The complexity of the MMA can be computed in a manner , , similar to the HRM. In fact, the numbers of FA, , and digit adders for the MMA, are identical to those

1066

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

required for the HRM. However, the introduction of one CFA per column imposes an additional area complexity of (45) th column. The time complexity is and a CFA at the computed in a manner similar to the HRM case and is given by (46) where

and

are evaluated from (31) and (32).

VI. RADIX- BINARY-TO-RESIDUE AND RESIDUE-TO-BINARY CONVERSION

Fig. 7. The simplified digit adders used in the proposed high-radix binary-to-residue converters. The word lengths of the inputs and outputs of the adder are noted and n = log (r 1) + 1.

b

0 c

TABLE V DECOMPOSITION OF AN 8-BIT INPUT WORD INTO HIGH-RADIX RESIDUES OF THE BINARY WEIGHTS

In this section, efficient architectures for radix- modulobinary-to-radix- residue and radix- residue-to-binary conversions are introduced. A. High-Radix Binary-to-Residue Modulo-

Converters

An architecture for a high-radix binary-to-residue (HRBR) modulo- converter is proposed. The HRBR architecture directly converts a binary integer to a radix- residue. Consider an -bit integer written in binary form as (47) . The computation of the residues modulowhere of both sides of (47) allows the evaluation of the residue (48)

(49) with where . The modulo operation in (49), does not require any computation, due to the carry ignore property. Hence, the decomposition recursions used in [14] are avoided. In particular, the th column, carry digits, which could be produced at the should be ignored and, therefore, the particular carries are not generated; hence the corresponding hardware is eliminated. is performed The computation of the radix- residue by an array of special-purpose cells, called simplified digit adders (SAs). The HRBR converter can be conceived as a generalization of the binary-to-residue modulo- converters, and, which decompose the binary input into residues subsequently, conditionally add them, depending on the value In the proposed HRBR converter, each residue of , , is expressed in radix- format , the high-radix digits of which, are added by a suitable radix- SA array. The SAs do not perform general radix- addition. Instead, SAs placed at the th column add to their input, when the corresponding input constant values , for any for which bit is set, . Therefore, SAs can be used in implementing (49),

which implies that either or 0 is added to the th radix- digit column, depending on being 1 or 0, respectively. A three-character label is assigned to each SA to reveal its functionality as has three input described in the following. An SA labeled ports, namely , , and , as shown in Fig. 7(a). The value and assigned to port can be any radix- digit, while ports receive bits, which, when asserted, denote that the constants and/or respectively, should be added to the result. The values of (49). If the bits constants and correspond to the are assigned to the ports and , respectively, computes the radix- two-digit sum the SA of type (50) if if

and

if

and

(51) (52)

is the radix- digit computed by the SA placed above where . An SA placed at the top of a column, is labeled , as shown in Fig. 7(b), and it returns (53) are aswhere it is assumed that the input bits signed to the ports , , and , respectively. It should be noted and are used as building that, when SAs of types are either bits of the blocks of an array, the bits input , or carry bits from the less significant column. A carry bit can be assigned to any input port, which corresponds to an ad. dend The definition of a HRBR converter architecture is illustrated by means of an example. Assume a modulo-125 8-bit converter.

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

1067

Fig. 8. The high-radix binary-to-residue 8-bit modulo-125 converter architecture and a possible layout in a 0.7-m CMOS standard-cell library, occupying an area of 307 m 406 m. Variables A, B , and C of Fig. 7 have been replaced with their corresponding values.

2

In the particular case, it is , , and . The corvalues used in (49), are shown in Table V, while responding the converter architecture and a possible layout are shown in Fig. 8. The area and time performance of an 8-bit and a 12-bit modulo-125 HRBR converter are compared to previously reported conversion schemes [7], [14] in Table VI. ROM complexity is quantified by the model used in [5]. The particular examples demonstrate that the proposed conversion of a binary integer into a radix- residue can be less complicated than the conventional radix-2 residue conversion. Therefore, the performance improvement achieved by the high-radix architectures presented in previous sections is not compromised by binary-toradix- residue conversion overhead. In fact, further improvement becomes possible by means of the HRBR converter. B. High-Radix Residue-to-Binary Converters Chinese Remainder Theorem (CRT)-based techniques for residue-to-binary conversion are discussed. The total number required to represent a high-radix residue modulo of bits is given by (54) radix- digits, each of binary word length . However, the residue modulo , expressed in radix-2 format, has a word length of

TABLE VI AREA AND TIME COMPLEXITY OF THE PROPOSED HRBR CONVERTERS FOR 8- AND 12-BIT INPUT WORDS AND MODULO r = 5 = 125

partial sum generator, a partial sum adder, a range determinator, and a final converter. To accommodate a radix- residue, only the partial sum generator is altered. In particular, the residue bit partitioning, used to reduce the number of partial sums and maintain moderate lookup table size, can also be applied in radix- arithmetic, as shown below. Assume that the th residue of an RNS is

as there are

(55) , the high-radix representation imposes exWhen actly identical complexity to that of radix-2 residue decoding, assuming that the CRT-based residue decoding architecture introduced by Elleithy et al. [11] is employed, which consists of a

(56) where are the radix- digits, each of which is represented in binary form as (57)

1068

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

where and (57), it is obtained that

. From (56) and

TABLE VII BIT WEIGHTS FOR THE DEFINITION OF THE FULL ADDER-BASED p GENERATOR IN THE INVERSE CONVERSION EXAMPLE

(58) Following [11] and by taking into account the radix- representation (58) of , it follows that (59)

(60)

TABLE VIII AREA COMPLEXITIES OF THE HIGH-RADIX MULTIPLIER AND OF PREVIOUSLY REPORTED ARCHITECTURES. RADIX-3 DIGIT ADDERS OPTIMIZED FOR LOW TIME COMPLEXITY ARE ASSUMED

(61) , , and denotes where multiplicative inverse of . Notice that the the moduloequality of (60) to (59) is due to the particular relation of the with the product terms. Equation (61) reveals that modulus the partial sum generator proposed in [11] is also applicable in radix- arithmetic. , to prevent the increase in lookup table In the case size, an 1-bit radix-2 full adder-based architecture can be em. In ployed to evaluate the partial sum particular, can be computed by decomposing the residue into bits , each of weight , since it is obtained from (58) that

TABLE IX TIME COMPLEXITIES OF THE PROPOSED HIGH-RADIX MULTIPLIER AND OF PREVIOUSLY REPORTED ARCHITECTURES

(62) Subsequently, 1-bit full adder-based structures can be derived [6], [15], the efficiency of which depends on the choice of the and RNS base, since the base dictates the actual values of . As an example, consider the base . It follows that , , , , while the multiplicative inverses are , and , and . Notice that for it is . The partial sum in (60), which correis . sponds to are shown in Table VII for The weights of the bits . By assigning each to the digital columns dictated by the corresponding weight, a radix-2 full adder-based architecture for the evaluation of is obtained. has a weight of , it For example, as should be added to all columns, which correspond to a “1” in the binary expression of the weight [15], i.e., to the th signif. In the particular icant columns, where and example, an architecture with an area complexity of can be obtained. For comparison, a ROM-based a delay of requires a lookup table of architecture for the evaluation of

TABLE X

AREA

2 TIME COMPLEXITY OF VARIOUS RESIDUE MULTIPLIERS AND THE REDUCTION FACTOR ACHIEVED

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

1069

Fig. 9. Comparative complexities of various modulo-5 multipliers.

size bits, which has a complexity of at a delay . Although , in the particular example the proof posed full-adder-based generator is better in an area time sense than the lookup table-based generator. Therefore, even , the radix- residue decoding can be performed when efficiently.

VII. PERFORMANCE COMPARISONS The performance of the radix- modulo- HRM is compared to that of previously reported architectures [5], [6], [16], as well as to the performance of a straightforward ROM lookup table-based architecture. Comparison to [6] is performed by deriving an architecture as described in [6] for each modulus and evaluating its performance. Two models of ROM complexity are employed in the comparisons, one based on experimental performance data of ROM cells generated by a commercial tool [17] and the ROM complexity model utilized by DiClaudio et al. [5]. The architectures are compared in terms of area, time, and area time complexities, as shown in Tables VIII–X. By using the digit adder complexities shown in Table III, and the area and time complexities (31) and (33), area and time comand plexity bounds of the proposed radix- HRM for are obtained. Tables VIII–X show that radix- HRM reduce the area, time, and area time complexity in all cases but one by a larger than one factor, and, for some moduli, complexities are reduced several times. Reduction factor is the ratio of the complexity of an architecture over the corresponding complexity of the proposed architecture. In these tables, the experimental ROM performance model is assumed.

The relation between the radix- word length , i.e., the number of the required radix- digits, of a residue moduloand a corresponding radix-2 word length of a residue is modulo (63) Equation (63) is utilized in the area, time, area time, and area time square complexity comparisons of the HRM moduloto the published architectures [5], [16] and to a direct ROM lookup-table implementation, which are shown in Figs. 9 and 10. ROM performance in Figs. 9 and 10 occurs by assuming the complexity model utilized in [5]. Fig. 9 illustrates that radix-5 HRMs are better in the area, area time, and area time square sense, while they are slower for . Fig. 10 shows that radix-3 HRMs are slower, but they are of the smallest area, area time, and area time square complexity. The radix-3 digit adder complexities of Table III are used for the area time square complexity diagram in Fig. 10, while the remaining of the plots in Fig. 10, assume digit adders optimized for low area complexity. The complexity measures of the proposed HRMs utilized in the comparisons, are upper bounds of the actually achieved performance, since the HRMs can be further optimized by modern logic synthesis and CAD optimization tools. For example, by optimizing the modulo-125 radix-5 HRM architecture with Autologic II, an time complexity of 13 650 is obtained, which is area obtained from Table X. less than The area time complexity of optimized radix-5 moduloadders is summarized in Table XI, where it is shown that they are comparable to offset residue adders [18]. Therefore, the radix-

1070

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

Fig. 10.

Comparative complexities of various modulo-3 multipliers.

TABLE XI

AREA

2 TIME COMPLEXITY OF THE PROPOSED RADIX-r ADDER

approach provides an area time efficient solution for a residue channel, which comprises conversions, multiplication, and addition. VIII. CONCLUSION Combinatorial arithmetic units based on radix- arithmetic to implement modulo- operations, which include multiplication, addition, and merged multiplication-addition have been introduced. Furthermore, efficient radix- residue-to-binary and binary-to-radix- residue techniques have been described. A comparative performance evaluation has illustrated that area time complexity can be reduced several times for moduli of time efficiency stems the form . The demonstrated area from the exploitation of the carry-ignore property, which is inherent in modulo- operations, and from the adoption of low-complexity digit adders. The complexity reduction of the latter is due to the exploitation of the incomplete set of values which can be assumed by interim digits in the architectures.

The actual maximum values are identified and are expressed as a function of the radix , in case they are not constants. The maximum digit values do not depend on the adopted modulus of operation ; hence, they could be exploited in RNS architectures independently of the form of the modulus. However, the performance of such architectures is not dealt with in this paper. Finally, to facilitate the evaluation of the proposed arithmetic units, when exploring the area-time design space of an RNS-based VLSI DSP system, complexity bounds have been offered. The interest in accelerating computations and reducing the corresponding area requirements has recently received a new dimension; the reduction of power dissipation. Coarsely and assuming a constant activity ratio, area reduction reflects switching capacitance reduction, while acceleration can be exploited to reduce the supply voltage, potentially leading to power dissipation savings [19]. Therefore, the RNS concept and the proposed architectures are worth investigating as a means of reducing power dissipation in signal processing. The proposed radix- arithmetic units may prove an interesting alternative for adoption in RNS DSPs. APPENDIX A PROOF OF THEOREM 3 , i.e., Initially, it is noted that the claim (25) holds for , as no one-bit carries are generated at columns for , while a single carry of maximum value two is genwhich . Subsequently, assume that (25) holds erated at column , i.e., for a particular (64)

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

1071

It is now proven that if (64) is true, then (25) holds when , i.e., that is true. By applying (24) for , it follows that

(75) Hence (73) is true. Similarly, by expanding the sum in (74), it is obtained that

(65) Two cases may be distinguished, depending on being an even for an or an odd integer. Assume that is even, i.e., integer . In this case, the two terms of the sum in (65) can be written as (66) (67) Hence, by replacing (66) and (67) into (65) it is obtained that, for even

(76) Hence (74) is true. A. The Number of

Digit Adders

From (26), the total number of the is given by columns

digit adders in the

(68) for an integer . Then Assume that is odd, i.e., the two terms of the sum in (65) can be written as

(77) Let be defined as

. Then (77) can be re-written as

(69)

(70) Hence, by using (69) and (70), (65) for odd

can be written as

(78) and because are distinguished depending on is odd, 1) when give

. Two cases being even or odd: , (78) and (74) for

(71) Therefore, from (68) and (71), it follows that for any value, even can be expressed as or odd, of

2) when

is even, give

, (78) and (73) for

(72) Therefore the claim (25) is true. Therefore, the total number of

APPENDIX B PROOFS OF THE COMPLEXITY BOUNDS

cells is odd

The proofs of the complexity measures in Table IV are based on the following proposition. Proposition 1: For any integer , it holds that

even when

(73)

.

B. The Number of

Digit Adders

From (28), the number of the

digit adders is given by

(74) (79) Proof: By expanding the left hand part of (73), it is obtained that

By the change of variable

, (79) is re-written as (80)

1072

as

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 10, OCTOBER 2000

and . Depending on being even or odd, two cases are distinguished: is even, i.e., , (80) and (73) for 1) when give

Therefore, the total number of

cells is even odd

when 2) when is odd, i.e., and (74) for

,

.

integer, (80) Digit Adders

D. The Number of

give

From (27) it follows that the number of least significant columns is the Therefore, the number of

digit adders in

cells is odd even

.

when

C. The Number of

(86) Digit Adders

From (29), the number of by

digit adders can be computed

applied to (86) gives

The change of variable

(87)

(81) has been applied and where the change of variable spans the range . Depending on being odd or even, two cases are distinguished: is odd, i.e., for an integer , 1) when , due to (74), it holds that

Depending on guished: 1) when integer that

2) When

being even or odd, two cases are distinis odd, i.e., , where is an from (87) and (74), it is obtained

is even, i.e., , where is an integer , from (87) and (73), it is obtained that

(82) From (81) and (82), it is obtained that

Therefore, the total number of

cells is odd

(83) even 2) When

is even, i.e., , (73) leads to

for an integer , when (84)

From (81) and (84), it is obtained that (85)

. REFERENCES

[1] N. Szabó and R. Tanaka, Residue Arithmetic and its Applications to Computer Technology. New York: McGraw-Hill, 1967. [2] F. Taylor, “Residue arithmetic: A tutorial with examples,” IEEE Comput., pp. 50–62, May 1984. [3] M. A. Bayoumi, G. A. Jullien, and W. C. Miller, “A VLSI implementation of residue adders,” IEEE Trans. Circuits Syst., vol. CAS-34, pp. 284–288, Mar. 1987.

PALIOURAS AND STOURAITIS: NOVEL HIGH-RADIX RESIDUE NUMBER SYSTEM ARCHITECTURES

[4] K. M. Elleithy and M. A. Bayoumi, “A systolic architecture for modulo multiplication,” IEEE Trans. Circuits Syst. II, vol. 42, pp. 725–729, Nov. 1995. [5] E. D. DiClaudio, F. Piazza, and G. Orlandi, “Fast combinatorial RNS processors for DSP applications,” IEEE Trans. Comput., vol. 44, pp. 624–633, May 1995. [6] T. Stouraitis, S. W. Kim, and A. Skavantzos, “Full adder-based arithmetic units for finite integer rings,” IEEE Trans. Circuits Syst. II, vol. 40, pp. 740–744, Nov. 1993. [7] S. J. Piestrak, “Design of residue generators and multioperand modular adders using carry-save adders,” IEEE Trans. Comput., vol. 43, pp. 68–77, Jan. 1994. [8] N. Takagi, “A radix-4 modular multiplication hardware algorithm for modular exponentiation,” IEEE Trans. Comput., vol. 41, pp. 949–956, Aug. 1992. [9] M. A. Soderstrand and R. A. Escott, “VLSI implementation in multiple-valued logic of an FIR digital filter using Residue Number System arithmetic,” IEEE Trans. Circuits Syst., vol. 31, pp. 5–25, Jan. 1986. [10] I. Koren, Computer Arithmetic Algorithms. Englewood Cliffs, NJ: Prentice-Hall, 1993. [11] K. M. Elleithy, M. A. Bayoumi, and K. P. Lee, “(log n) architectures for RNS arithmetic decoding,” in Proc. 9th Symp. Computer Arithmetic, Santa Monica, CA, 1989. [12] M. Graphics, Autologic II Users’ Guide: Mentor Graphics, 1994. [13] K. Hwang, Computer Arithmetic. New York: Wiley, 1979. [14] T. Stouraitis, “Efficient convertors for residue and quadratic-residue number systems,” Proc. Inst. Elect. Eng..—G, vol. 139, pp. 626–634, Dec. 1992. [15] V. Paliouras, D. Soudris, and T. Stouraitis, “Systematic derivation of the processing element of a systolic array based on Residue Number System,” in Proc. IEEE Int. Symp. Circuits and Systems, San Diego, CA, 1992. [16] M. Taheri, G. A. Jullien, and W. C. Miller, “High-speed signal processing using systolic arrays over finite fields,” IEEE J. Select. Areas Commun., vol. 6, pp. 504–512, Apr. 1988. [17] ES2, European Silicon Structures, Technical Manual. Rousset, France: ECPD07 CMOS Digital Library, 1992. [18] J. C. Smith and F. J. Taylor, “A fault-tolerant GEQRNS processing element for linear systolic array DSP applications,” IEEE Trans. Comput., vol. 44, pp. 1121–1130, Sept. 1995. [19] A. P. Chandrakasan, S. Sheng, and R. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid-State Circuits, vol. 27, pp. 473–484, Apr. 1992.

1073

Vassilis Paliouras (S’90–M’95) was born in Thessaloniki, Greece, in 1970. He received the Diploma in electrical engineering in 1992 and the Ph.D. degree in electrical engineering in 1999, both from the Electrical and Computer Engineering Department, University of Patras, Patras, Greece. He is currently a Researcher with the VLSI Design Laboratory, University of Patras. His interests include computer arithmetic algorithms and circuits and digital signal processor architecture, areas in which he has published more than 25 conference and journal articles. Dr. Paliouras received the MEDCHIP VLSI Design Award in 1997. He is also the recipient of the 2000 IEEE Circuits and Systems Society Guillemin-Cauer Award. He is a Member of ACM, SIAM, and the Technical Chamber of Greece.

Thanos Stouraitis (S’82–M’86–SM’97) received the B.S. degree in physics and the M.S. degree in electronic automation from the University of Athens, Athens, Greece, in 1979 and 1981, respectively, the M.S. degree in electrical engineering from the University of Cincinnati, Cincinnati, OH, in 1983, and the Ph.D. degree from the University of Florida, Gainesville, in 1986. He is a Professor of Electrical and Computer Engineering at the University of Patras, Patras, Greece. He has served on the faculty of the University of Florida and the Ohio State University. He has published two books, several book chapters, and about 30 journal and 70 conference papers in the areas of computer architecture, computer arithmetic, VLSI signal and image processing, and lowpower processing. Dr. Stouraitis is currently the Chair of the IEEE Circuits and Systems (CAS) Society’s Technical Committee on VLSI Systems and Applications, while serving on the Digital Signal Processing and Multimedia Systems committees. He has served as the General Chair of the IEEE 3rd International Conference on Electronics, Circuits, and Systems (ICECS’96), Co-Program Chair for EUSIPCO’98, and regularly serves on the Program Committee of various CAS and SP conferences. He was awarded the Outstanding Ph.D. Dissertation Award of the University of Florida, a Certificate of Appreciation by the IEEE CAS Society in 1997, and the 2000 IEEE CAS Society Guillemin-Cauer Award.

Lihat lebih banyak...

Novel high-radix residue number system architectures

Descrição do Produto

Comentários