A Real/Complex Logarithmic Number System ALU

June 14, 2017 | Autor: Mark Arnold | Categoria: Distributed Computing, Computer Hardware, Field-Programmable Gate Arrays, Computer Software, Fast Fourier Transforms

Share Embed

Denunciar este link

Descrição do Produto

202

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 60, NO. 2,

FEBRUARY 2011

A Real/Complex Logarithmic Number System ALU Mark G. Arnold, Member, IEEE, and Sylvain Collange Abstract—The real Logarithmic Number System (LNS) offers fast multiplication but uses more expensive addition. Cotransformation and higher order table methods allow real LNS ALUs with reasonable precision on Field-Programmable Gate Arrays (FPGAs). The Complex LNS (CLNS) is a generalization of LNS, which represents complex values in log-polar form. CLNS is a more compact representation than traditional rectangular methods, reducing bus and memory cost in the FFT; however, prior CLNS implementations were either slow CORDIC-based or expensive 2D-table-based approaches. Instead, we reuse real LNS hardware for CLNS, with specialized hardware (including a novel logsin that overcomes singularity problems) that is smaller than the real-valued LNS ALU to which it is attached. All units were derived from the Floating-Point-Cores (FloPoCo) library. FPGA synthesis shows our CLNS ALU is smaller than prior fast CLNS units. We also compare the accuracy of prior and proposed CLNS implementations. The most accurate of the proposed methods increases the error in radix-two FFTs by less than half a bit, and a more economical FloPoCo-based implementation increases the error by only one bit. Index Terms—Complex arithmetic, logarithmic number system, hardware function evaluation, FPGA, fast Fourier transform, VHDL.

Ç 1

INTRODUCTION

T

HE usual approach to complex arithmetic works with pairs of real numbers that represent points in a rectangular coordinate system. To multiply two complex numbers, denoted in this paper by upper-case variables, X and Y, using rectangular coordinates involves four real multiplications:

Y ¼ > < d2 ð2 c2 ð2 xÞÞ=2; 1 x < 2; ð18Þ c2 ðxÞ ¼ 2wez þ1 ; x ¼ 2; > > > d2 ð2 c2 ðx 2ÞÞ=2; 2 < x < 3; > > > > c ð4 xÞ; 3 x < 4; > : 2 c2 ðx 8Þ; x 4: We can reuse (18) to yield logsinð=4xÞ ¼ c2 ðx þ 2Þ as well as logcosð=4xÞ ¼ c2 ðxÞ. Instead of 1, the output at the singularity, 2wez þ1 is large enough to trigger essential zero in the later units, but is small enough to avoid overflow.

6

Fig. 1. CLNS ALU with complex and real modes.

derived in the Appendix. The operation of the sign logic unit is summarized in Table 1, which takes into account the special cases described in the Appendix. The sign logic unit controls the Four-Quadrant (4Q) Correction Unit, the sb =db unit, and the output Multiplexor (Mux). This logic operates in two modes: complex mode, which is the focus of this paper, and real mode (which uses the sb /db unit to perform traditional LNS arithmetic). In real mode, Z ¼ 0 is used to represent a positive sign, and Z ¼ is used to represent a negative sign. Given the angular quantization, the LNS sign bit corresponds to the most significant bit of Z^ , with the other bits masked to zero. In order to minimize the cost of CLNS hardware implementation, we exploit several techniques to transform the arguments of the function approximation units into limited ranges where approximation is affordable. Assuming b ¼ 2, the reduction for the addition logarithm is typical of real LNS implementations: 8 0; x < 2wez ; > > < log2 ð1 þ 2x Þ; 2wez < x 0; s2 ðxÞ ¼ ð16Þ x þ s2 ðxÞ; 0 < x < 2wez ; > > : wez x; x 2 ; where wez kL describes the wordsize needed to reach the “essential zero,” [16] which is the least negative value that causes log2 ð1 þ 2x Þ to be quantized to zero. We use similar range reduction for the subtraction logarithm when z is far from zero, together with cotransformation [21] for z near

DIRECT FUNCTION LOOKUP

There are several possible hardware realizations for (26) and (34), which appear in the Appendix. When fL and f are small, it is possible to use a direct table lookup for s^b , d^b , a^b , and c^b . This implementation allows round-to-nearest results for each functional unit, reducing roundoff errors in (26) and (34). Fig. 2a shows the errors in computing the real part (26) as a function of ZL and Z with such roundto-nearest tables when fL ¼ f ¼ 7. These errors have a Root-Mean-Squared (RMS) value around 0.003. Fig. 2b shows errors for the imaginary part (34), with an RMS value around 0.002. The errors in these figures spike slightly near ZL ¼ 0, Z ¼ because of the inherent singularity in Sb ðZÞ. In contrast to later figures, here the only noticeable error spike occurs near the singularity. The memory requirements for each of the s^b , d^b , and a^b direct lookup tables (with a Z^L input) is 2wez þfL words, assuming reduction rules like (16) and (17), which allow the interval spanned by these tables to be as small as ð2wez ; 0Þ. The trigonometric tables, with Z^ inputs, require 22þf words because Z is reduced to 0 Z^ < 4 (equivalent to 0 Z < ) and two integer bits are sufficient to cover this dynamic range. Instead of the more involved (18) rule used in later sections, cb ðxÞ ¼ cb ðxÞ suffices for direct lookup. Since there are two instances of trigonometric tables and four instances of other tables in Fig. 1, the total number of words required for naı¨ve direct lookup is 2fþwez þ2 þ 2fþ3 assuming f ¼ fL ¼ f . Table 2 shows the number of words required by prior 2D direct lookup methods [3], [19] compared to our proposed 1D direct lookup implementation of (26) and (34). Savings occur for f 5, and grow exponentially more beneficial as f increases. For f 8, the cost of direct implementation grows exponentially, although at a slower rate than the prior methods. One approach to reduce table size is to use

206

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 60, NO. 2,

FEBRUARY 2011

Fig. 2. Error in S2 ðZÞ using direct lookup with fL ¼ f ¼ 7. (a) Real part, (b) Imaginary part.

interpolation on a smaller table. The cost of a simple linear interpolation unit is related to the size of the region of the interpolation and to the absolute maximum value of the second derivative in that region. Table 3 gives the range and the derivatives for the required functions. The db function is not considered as it may be computed via cotransformation [21]. It is clear that logsin has a singularity similar to db , which is the reason we avoid evaluating it directly, and instead, have chosen to use the relationship cosðx =2Þ ¼ sinðxÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 1 cos2 ðxÞ in (18). Because of the quantization of c^2 , this approximation is imperfect very near the singularity. In Section 7.7, we introduce a novel approach to deal with the singularity of logsin , but for the moment we will use (18). The absolute maximum value (1:78) of the second derivative for logcos in its restricted range is large; however, the fact that the range is restricted to 0 x 1 mitigates this. It is well known [12] that the absolute maximum value (0:17) of the second derivative of sb is modest; however, this table needs to cover a much larger range than logcos . The absolute maximum value (0:15) of the arctangent exponential function is about the same as s2 and its range is equally large. These values predict that interpolation from a uniformly spaced table will be the best choice for cb , but would require significant memory for sb and ab . Instead, it would be more effective to optimize the sb and ab approximations into subdomains, as explained in the next section. Such subdomains are effective for sb and ab (because their absolute second derivatives span many binades from 0:0 to over 0.1), but are not effective for cb (because its absolute second derivative only varies one binade from 2 =ð16 lnðbÞÞ to 2 =ð8 lnðbÞÞ).

7

OPTIMIZED FUNCTION UNIT SYNTHESIS

The novel CLNS algorithm requires several special function units, which should be optimized to minimize the cost of TABLE 2 Memory for Prior and Novel Direct Methods

the units, especially as the precision, f, increases. These units were generated using enhanced versions of existing tools to simplify generating and optimizing synthesizable VHDL for these units.

7.1 FPLibrary FPLibrary is an implementation of floating point and LNS operators for FPGAs developed by Detrey and De Dinechin [8]. It provides addition, multiplication, division, and squareroot operators for both systems, with a user-defined precision, scaling up to IEEE single precision. LNS addition and subtraction are implemented using multipartite tables, generated in VHDL for every combination of parameters. It was later extended by Vouzis et al. to perform the LNS subtraction using cotransformation [21]. 7.2 HOTBM A method for function evaluation, using tables and polynomial interpolations, called Higher Order TableBased Methods (HOTBM) was developed by Detrey and De Dinechin [9]. The input domain of the function to be evaluated is split into regular intervals, and the function is approximated by a minimax polynomial inside each interval. Polynomials are evaluated as a sum of terms, each term being computed either by a powering unit and a multiplier or a table. The size of each component (table, multiplier, and powering unit) is tuned to balance the rounding error and avoid unnecessary computations. As an implementation of this method, the authors also provide a tool written in C++ which generates hardware function evaluators in synthesizable VHDL. An exhaustive search in the parameter space is performed during generation to estimate the most area- and delay-efficient design. 7.3 FloPoCo FloPoCo, for Floating-Point Cores, is intended to become a superset of both FPLibrary and HOTBM. Like HOTBM, it consists of a C++ program generating VHDL code [10]. In addition to the usual floating-point and fixed-point arithmetic operators, it can generate more exotic operators, such as long accumulators [11] or constant multipliers [6]. It aims at taking advantage of FPGA flexibility in order to increase efficiency and accuracy compared to a naı¨ve translation of a software algorithm to an FPGA circuit. Although it includes an implementation of HOTBM for fixed-point function evaluation and a whole range of floating-point operators, the LNS part of FPLibrary was not ported to FloPoCo by its developers.

ARNOLD AND COLLANGE: A REAL/COMPLEX LOGARITHMIC NUMBER SYSTEM ALU

207

TABLE 3 Table of Functions and Derivatives

7.4 A CLNS Coprocessor We extended the FloPoCo library with real LNS arithmetic. To perform the LNS addition, the input domain of sb is split into three subdomains: ½2wez ; 8½, ½8; 4½, and ½4; 0½, where wez is such that sb evaluates to zero in 1; 2wez ½. This is done to account for the exponential nature of sb ðzÞ for smaller values of z. The same partitioning was used in FPLibrary. We then use an HOTBM operator to evaluate sb inside each interval. This partitioning scheme allows significant area improvements compared to a regular partitioning as performed by HOTBM. Likewise, the domain of db is split into intervals. The lower range ½2wez ; t½ is evaluated using several HOTBM operators just like sb . As we get closer to the singularity of db near 0, the function becomes harder to evaluate using polynomial approximations. The upper range ½t; 0½ is evaluated using cotransformation. This approach is similar to the one used by Vouzis et al. in FPLibrary [21]. We used this extended FloPoCo library to generate the VHDL code for the functional units in a CLNS ALU at various precisions. One component was generated for each function involved in the CLNS addition. Real LNS functions sb and db are generated by the extended FloPoCo. The function logcosð=4xÞ is implemented as an HOTBM component, and 4= arctanð2x Þ is evaluated using several HOTBM components with the same input domain decomposition as sb . Given the complexity of HOTBM and the influence of synthesizer optimizations, it is difficult to estimate precisely the best value of the db threshold t, cotransformation parameter j, and HOTBM order using simple formulas. We synthesized the individual units for a Virtex-4 LX25 with Xilinx ISE 10.1 using various parameters. For each precision tested, the combination of parameters yielding the smallest area on this configuration was determined. Table 4 sums up the parameters obtained. We plan to automate this search in the future by using heuristics relying on the latency/area estimation framework integrated in FloPoCo. TABLE 4 Parameters Used for Synthesis

7.5 Time/Area Trade-Off The synthesis results are listed in detail in Tables 5 and 6, where s , s=d , c , and a are the areas of the respective units, and s , s=d , c , and a are the corresponding delays. One implementation option would be to follow the combinational architecture in Fig. 1, in which case the area is roughly s þ 2s=d þ c þ a because one db together with ab implements both logcos and logsin . (We neglect the insignificant area for four fixed-point adders, a mux, and the control logic.) The delay is 2s=d þ c þ maxða ; s Þ. An alternative, which saves area (s=d þ c þ a ), uses three cycles to complete the computation (so that the same hybrid unit may be reused for all three sb =db usages). The total area is less than double the area s=d of a stand-alone real LNS unit. The delay is roughly 3s=d , since the hybrid unit has the longest delay. Despite using only about half the area of the combinational alternative, the three-cycle approach produces the result in about the same time as the combinational option. It is hard to make direct comparisons, but the CORDIC implementation of CLNS reported in [13] uses 10 stages. Some of those stages have relatively large multipliers, similar to what is generated for our circuit by FloPoCo. If such CORDIC stages have a delay longer than 0.3 of our proposed clock cycle, our approach would be as fast or faster than [13]. 7.6 Accuracy Fig. 3a shows the errors in computing the real part (26) as a function of ZL and Z using the f ¼ 7 function units generated by FloPoCo. These errors have an RMS value around 0.007. This is larger than the errors when computing the direct lookup with the round-to-nearest method shown in Fig. 2a, but overall the errors appear uniform except, as before, near the singularity. As such, this may be acceptable for some applications. In contrast, Fig. 3b shows errors for the imaginary part (34), which are significantly more than the errors for direct lookup round-to-nearest shown in Fig. 2b. These errors have an RMS value around 0.008. To determine the cause of the much larger error, simulations were run, in which each of the function units generated by FloPoCo was replaced with round-to-nearest units. When both logsin and logcos are computed with round-to-nearest, the imaginary result has smaller errors which are much closer to those of Fig. 2b. Using sinðxÞ ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ p 1 cos2 ðxÞ contributes to the extra errors, which are then magnified by the slightly larger roundoff errors from all the other FloPoCo units. 7.7 Novel Sine Approximation The source of this additional error visible in Fig. 3b is the quantization of logcos as c^b in (18). Extra guard bits could

208

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 60, NO. 2,

FEBRUARY 2011

TABLE 5 Area of Function Approximation Units (Slices) on Xilinx Virtex-4

TABLE 6 Latency of Function Approximation Units (ns) on Xilinx Virtex-4

Fig. 3. Error in S2 ðZÞ using FloPoCo with Pythagorean (18) and fL ¼ f ¼ 7. (a) Real part. (b) Imaginary part.

help, but this, in turn, would require extra guard bits in all the approximation units, including the very costly one for d^b . We consider the cost of such a solution to be prohibitive. Although inexpensive methods [1] for dealing with logsinð2x Þ have been proposed, they are not applicable here. Instead, we need an alternative way to compute logsinðxÞ for x 0. The novel approach, which we propose, is based on the simple observation that for small x, sinðxÞ x:

ð19Þ

The values of x, which are considered small enough to be approximated this way, will depend on the precision, f. The CLNS hardware uses a quantized angle, which means (19) can be restated as: logsinð=4 xÞ logb ðxÞ þ logb ð=4Þ:

ð20Þ

We wish to implement this without adding any additional tables, delay, or complexity to the hardware (aside from a tiny bit of extra logic), which means we rule out evaluating logb ðxÞ directly. Instead, we use a novel logarithm approximation [4] that only needs db hardware plus a few trivial adders, and which is moderately accurate in cases like this, where x is known to be near zero:

logb ðxÞ ¼ db ðxÞ logb ðlnðbÞÞ þ x=2:

ð21Þ

The error is on the order of x2 =24. Substituting (21) into (20), we have logsinð=4 xÞ db ðxÞ þ x=2 þ ðlogb ð=4Þ logb ðlnðbÞÞÞ: ð22Þ The binary value of log2 ð=4Þ log2 ðlnð2ÞÞ ¼ 0:00101110001002 suggests 1=8, 3=16, or 23=128 are low-cost approximations for this constant. Fig. 4 shows the error of approximating logsin with the Pythagorean approach, pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 1 cos2 ð=4 xÞ (implemented as d2 ð2 c2 ðxÞÞ=2), versus the novel approach (22) for f ¼ 7. For values of x near zero, the novel approach is consistently better than the Pythagorean approach. Because of quantization, both plots are noisy which makes them cross each other several times in the middle. For f ¼ 7, x ¼ 0:375 is a reasonable choice for the point at which the approximation switches from the novel approach to the Pythagorean one. Replacing these cases in (18), we have

ARNOLD AND COLLANGE: A REAL/COMPLEX LOGARITHMIC NUMBER SYSTEM ALU

209

make this slightly nicer, Java’s polymorphic method-call syntax makes it fairly easy to describe the complex FFT butterfly succinctly: t ¼ w½j:mulðx½m þ iÞ; Fig. 4. Errors for Pythagorean and novel logsinð=4 xÞ for 0 < x 1.

8 c2 ðxÞ; > > > > logcosð=4xÞ; > > > > d2 ð2 c2 ð2 xÞÞ=2; > > > > d2 ðx 2Þ þ ð2 xÞ=2 > > > > < þ ðlog2 ð=4Þ log2 ðlnð2ÞÞÞ; c2 ðxÞ ¼ 2wez þ1 ; > > d2 ð2 xÞ þ ðx 2Þ=2 > > > > > þ ðlog2 ð=4Þ log2 ðlnð2ÞÞÞ; > > > > d ð2 c2 ðx 2ÞÞ=2; 2 > > > > ð4 xÞ; c 2 > : c2 ðx 8Þ;

x < 0; 0 x < 1; 1 x < 1:625; 1:625 x < 2; x ¼ 2; 2 < x < 2:375; 2:375 x < 3; 3 x < 4; x 4: ð23Þ

The same function approximation units and pipeline schedule described earlier can implement (23) as easily as (18), with only two additional adders (and possibly a register to make x available at the proper time). Fig. 5a shows the error for the real part of the approximate S2 ðZÞ using (23) for range reduction and otherwise using f ¼ 7 FloPoCo-generated approximation units. Fig. 5b shows the errors for the corresponding imaginary part. In both figures, especially for the imaginary part, the errors are much closer to the amount seen in round-to-nearest. The RMS errors (0.006 for the real part and 0.005 for the imaginary part) are in between those for round-to-nearest and those for Pythagorean FloPoCo.

8

OBJECT-ORIENTED CLNS SIMULATION

In order to test the proposed techniques in a complex number application, like the FFT, we would like to substitute different arithmetic implementations without significant change to the application code. The polymorphism of an object-oriented language like Java makes such experimentation easy. Although operator overloading in languages like C++ would

x½m þ i ¼ x½m:subðtÞ; x½m ¼ x½m:addðtÞ; where the variables in the application are of an abstract class, which defines four abstract methods (PLUSI(), inc(), recip(), and mul(CmplxAbs y)) that return references to such a CmplxAbs object and two abstract methods (real() and imag()) that return doubles. In addition, CmplxAbs defines several concrete methods that can be used in the application: public CmplxAbs MINUS1(){ return(PLUSI().mul(PLUSI()));} public CmplxAbs div(CmplxAbs y){ return(this.mul(y.recip()));} public CmplxAbs add(CmplxAbs y){ return(this.mul((y.div(this)).inc()));} public CmplxAbs neg(){ return(this.mul(MINUS1()));} public CmplxAbs sub(CmplxAbs y){ return(this.add(y.neg()));} Surprisingly, this is sufficient to define both polar and rectangular complex arithmetic implementations. Only three arithmetic methods (mul, recip, and inc) need to be provided for each derived class. Also, the derived class needs to provide three utility methods (the accessors for the constant i and the rectangular parts of this). Although best suited for CLNS implementations, this abstract class also works with a rectangular class using two floating-point instance variables manipulated by (1) and (2) together with the rectangular definition of reciprocal and incrementation. This rather unusual factoring of complex arithmetic has the advantage that application code can be tested with the rectangular implementation to isolate whether there are any CLNS-related bugs in the application or in the class definition itself. Two CLNS classes derived from the abstract class store Z^L and Z^ as integers. The first CLNS class implements ideal, round-to-nearest 2D lookup of

Fig. 5. Error in S2 ðZÞ using FloPoCo with novel (23) and fL ¼ f ¼ 7. (a) Real part. (b) Imaginary part.

210

IEEE TRANSACTIONS ON COMPUTERS,

Sb ðZÞ as in the prior literature as given by (11) and (12). The second class, which is derived from the first CLNS class, implements the proposed approach (26) and (34), using tables that can simulate any of the design alternatives discussed earlier. The first CLNS class defines all six methods; the second CLNS class inherits everything except its inc method. As CLNS formulas are involved and prone to implementation error, this object-oriented approach reduces duplication of untested formulas. In other words, we test add, sub, etc., using simple rectangular definitions; we test CLNS methods with the more straightforward (11) and (12). Only when we are certain these are correct do we test with the novel inc method in the grandchild class. Having the two CLNS classes allows us to reuse the same application code to see what effect the proposed ALU will have compared to ideal CLNS arithmetic. In this case, the application is a 64-point radix-two FFT whose input is a real-valued 25 percent duty cycle square wave plus complex white noise. This is rerun 100 times with different pseudorandom noise. On each run, several FFTs are computed using the same data: nearly exact rectangular double-precision arithmetic, ideal (2D table lookup) f ¼ 7 CLNS, and variations (direct, FloPoCo-Pythagorean, FloPoCo-novel-sine) of the proposed f ¼ 7 CLNS. The CLNS results from each run are compared against the doubleprecision rectangular FFT on the same data. The ideal CLNS FFT has RMS error 0.00025 and maximum error of 0.0033. The proposed direct lookup approach has about 50 percent higher RMS error (about 0.00035) and similar maximum error (0.0029). The FloPoCo-Pythagorean approach did much worse (RMS error of 0.00073 and maximum error of 0.01). The FloPoCo-novel-sine approach is better (RMS error of 0.00056 and maximum error of 0.0064). In other words, the best FloPoCo implementation loses about one bit of accuracy compared to the best possible (but unaffordable) f ¼ 7 CLNS implementation and the FloPoCo-Pythagorean loses more. To put these errors in perspective, the quantization step for Z^L is 27 0:0078, and on that scale perhaps even the errors observed in the FloPoCo-Pythagorean approach may be acceptable for some applications.

VOL. 60, NO. 2,

FEBRUARY 2011

zero which uses the same hardware as the Pythagorean-only approach used in our earlier research [5]. We compared these implementations of our novel CLNS algorithm to all known prior approaches. Our novel method has speed comparable to CORDIC, and uses orders-ofmagnitude less area than prior 2D table lookup approaches. As such, our approach provides a good compromise between speed and area. Considering that CLNS offers smaller bus widths than conventional rectangular representation of complex numbers, our proposed algorithm makes the use of this rather unusual number system in practical algorithms, like the FFTs in OFDM, more feasible. Using an object-oriented simulation, we have observed the additional errors introduced by our proposed methods in an FFT simulation. With our most accurate direct lookup approach, this is on the order of half of a bit. With the FloPoCo-novel-sine method, the additional error is about one bit. Given that CLNS offers many bit savings compared to rectangular arithmetic, this level of additional error seems a reasonable trade-off in exchange for the huge memory savings our proposed CLNS ALU offers. In the course of implementing the CLNS ALU, we uncovered and corrected several bugs in the HOTBM implementation of FloPoCo. We also made it easier to use by allowing the user to select the input range and scale of the target function instead of having to map its ranges manually to ½0; 1½ ! ½1; 1½. Hence, our work contributes to the maturity of the FloPoCo tool beyond the field of LNS.

APPENDIX In Case 1 (=2 < Z < =2), the real part of (14) can be described as (24). There is a similar derivation

> > > > > =2 < Z < 0 or 0 < Z < =2; > < s ðlogcosðZ Þ þ Z Þ; Z ¼ 0; b L ¼ sb ð2ðdb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞÞ > > logsinðZ Þ þ ZL þ ; > 2 > > > < Z < =2 or Z > =2; > > > : db ðlogcosðZ Þ þ ZL Þ; Z ¼ : ð26Þ

from (15) for Case 2, (Z < =2 or Z > =2), shown as (25). The subexpression ðlogsinðZ Þ þ ZL Þ in (24) and (25) becomes 1 when Z ¼ 0 or Z ¼ . The former case, in fact, is trivially 0; > > : =2; y > 0; x ¼ 0:

ð27Þ

Eliminating the recursion in (27), and noting that the value passed to the one-argument arctangent is now always nonnegative, we have

8 þ arctanðjyj=jxjÞ; y < 0; x < 0; > > > > arctanðjyj=jxjÞ; y < 0; x > 0; > > < arctanðjyj=jxjÞ; y 0; x < 0; arctanðjyj=jxjÞ; y 0; x > 0; > > > > =2; y < 0; x ¼ 0; > > : =2; y > 0; x ¼ 0:

211

when cosðZ ÞbZL > 1. In LNS, the latter condition is equivalent to logcosðZ Þ þ ZL < 0. Combining these conditions with (14), (15), and (28), we have (29). Simplifying the conditions, we get (30). Note that the intermediate expression in (31) is the negative of that used for the real part in the CLNS ALU derived above. In two cases, we need to expand the arctangent derivation into sb and db subcases so that we can substitute the intermediate expression into the arctangent. Again, eliminating impossible conditions yields (33). As in (26), there is a similar problem about the singularity of logsinðZ Þ in the case of Z ¼ 0 or Z ¼ . Again, this is trivial since =½Sb ðZÞ ¼ 0 or =½Sb ðZÞ ¼ . Taking these into account, we can substitute the LNS computation of the quotient into the arctangent to form (34).

=ðSb ðZÞÞ 8 j sinðZ ÞbZL j > > ; Z < 0 and þ arctan > > j1 þ cosðZ ÞbZL j > > > > > ðZ < =2 or Z > =2Þ and logcosðZ Þ þ ZL > 0 > > > > > j sinðZ ÞbZL j > > > ; ðZ < 0 and ð=2 < Z arctan > > j1 þ cosðZ ÞbZL j > > > > > < =2ÞÞ or logcosðZ Þ þ ZL < 0 > > > > > j sinðZ ÞbZL j > > > < arctan j1 þ cosðZ ÞbZL j ; Z 0 and ðZ < =2 ¼ > or Z > =2Þ and logcosðZ Þ þ ZL > 0 > > > > j sinðZ ÞbZL j > > > ; ðZ 0 and ð=2 < Z arctan > > j1 þ cosðZ ÞbZL j > > > > > < =2ÞÞ or logcosðZ Þ þ ZL < 0 > > > > > =2; Z < 0 and ðZ < =2 or Z > =2Þ and > > > > > logcosðZ Þ þ ZL ¼ 0 > > > > > =2; Z 0 and ðZ < =2jZ > =2Þ and > > : logcosðZ Þ þ ZL ¼ 0: ð29Þ

ð28Þ

The result produced for j sinðZ ÞbZL j=j1 þ cosðZÞbZL j in LNS depends on whether sb or db is required to produce 1 þ cosðZ ÞbZL , which again depends on the sign of the cosine (either positive with =2 < Z < =2 or negative with Z < =2; Z > =2Þ. It is obvious for the range of Z involved that y ¼ sinðZ ÞbZL < 0 when Z < 0. The condition when x ¼ 1 þ cosðZ ÞbZL < 0 is slightly more complicated, involving cosðZ ÞbZL < 1, a condition that can only happen when sgncosðZ Þ < 0 (i.e., Z < =2 or Z > =2Þ and j cosðZ ÞbZL j > 1. In LNS, the latter condition is equivalent to logcosðZ Þ þ ZL > 0. The opposite condition x > 0 happens when the cosine is positive, ð=2 < Z < =2Þ, or

8 j sinðZ ÞbZL j > > > ; Z < =2 þ arctan > > j1 þ cosðZ ÞbZL j > > > and logcosðZ Þ þ Z > 0 > L > > > ZL > j sinðZ Þb j > > ; Z < 0 and arctan > > > j1 þ cosðZ ÞbZL j > > > ðZ > =2 þ ZL < 0Þ > Þ > or logcosðZ < j sinðZ ÞbZL j =ðSb ðZÞÞ ¼ arctan ; Z > =2 > j1 þ cosðZ ÞbZL j > > > > and logcosðZ Þ þ ZL > > 0 > > ZL > j sinðZ Þb j > > ; Z 0 and ðZ arctan > > > j1 þ cosðZ ÞbZL j > > > < =2 or logcosðZ Þ þ ZL < 0Þ > > > > > =2; Z < =2 and logcosðZ Þ þ ZL ¼ 0 > : Z > =2 and logcosðZ Þ þ ZL ¼ 0: ð30Þ

212

IEEE TRANSACTIONS ON COMPUTERS,

!

j sinðZ ÞbZL j j1 þ cosðZ ÞbZL j 8 sb ðlogcosðZ Þ þ ZL Þ ðlogsinðZ Þ þ ZL Þ ; > > > < =2 < Z < =2; ¼ > db ðlogcosðZ Þ þ ZL Þ ðlogsinðZ Þ þ ZL Þ ; > > : Z < =2 or Z > =2:

log

ð31Þ

=ðSb ðZÞÞ 8 > j sinðZ ÞbZL j > > ; Z < =2 and þ arctan > > j1 þ cosðZ ÞbZL j > > > > > logcosðZ Þ þ ZL > 0; > > > > > j sinðZ ÞbZL j > > > arctan ; Z =2 and > > j1 þ cosðZ ÞbZL j > > > > > ðZ > =2 or logcosðZ Þ þ ZL < 0Þ; > > > > > j sinðZ ÞbZL j > > ; =2 < Z < 0 and arctan > > > j1 þ cosðZ ÞbZL j > > > > or logcosðZ Þ þ ZL < 0Þ; > < ðZ > =2 ¼ j sinðZ ÞbZL j > ; Z > =2 and arctan > > j1 þ cosðZ ÞbZL j > > > > > logcosðZ Þ þ ZL > 0; > > > > > j sinðZ ÞbZL j > > ; Z =2 and ðZ < =2 arctan > > > j1 þ cosðZ ÞbZL j > > > > > or logcosðZ Þ þ ZL < 0Þ; > > > > > j sinðZ ÞbZL j > > ; 0 < Z ; arctan > > j1 þ cosðZ ÞbZL j > > > > > =2; Z < =2 and logcosðZ Þ þ ZL ¼ 0; > > > : =2; Z > =2 and logcosðZ Þ þ ZL ¼ 0:

FEBRUARY 2011

=ðSb ðZÞÞ 8 0; Z ¼ and logcosðZ Þ þ ZL < 0; > > > > > ; Z ¼ and logcosðZ Þ þ ZL > 0; > > > > þ arctanðbðdb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞ Þ; > > > > > > < Z =2 and logcosðZ Þ þ ZL > 0; > > > > arctanðbðdb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞ Þ; Z =2 > > > > > > Þ þ Z < 0; and logcosðZ > L > > > ðsb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞ > arctanðb Þ; > < =2 < Z < 0; 0; Z ¼ 0; ¼ > > > arctanðbðdb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞ Þ; Z > =2 > > > > > > and logcosðZ Þ þ ZL > 0; > > > > > arctanðbðdb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞ Þ; Z > =2 and > > > > > logcosðZ Þ þ ZL < 0; > > > > > arctanðbðsb ðlogcosðZ ÞþZL ÞðlogsinðZ ÞþZL ÞÞ Þ; 0 < Z =2; > > > > > > =2; Z < =2 and logcosðZ Þ þ ZL ¼ 0; > : =2; Z > =2 and logcosðZ Þ þ ZL ¼ 0: ð34Þ

REFERENCES [1]

[2]

[3]

[4]

ð32Þ [5]

=ðSb ðZÞÞ 8 j sinðZ ÞbZL j > > ; Z < =2 and þ arctan > > > j1 þ cosðZ ÞbZL j > > > > > logcosðZ Þ þ ZL > 0; > > > > > j sinðZ ÞbZL j > > ; Z =2 and arctan > > j1 þ cosðZ ÞbZL j > > > > > logcosðZ Þ þ ZL < 0; > > > > > j sinðZ ÞbZL j > > > arctan ; =2 < Z < 0; > > j1 þ cosðZ ÞbZL j > > > < ð33Þ j sinðZ ÞbZL j ¼ arctan ; Z > =2 and Z > j1 þ cosðZ Þb L j > > > > > logcosðZ Þ þ ZL > 0; > > > > > j sinðZ ÞbZL j > > ; Z =2 and arctan > > j1 þ cosðZ ÞbZL j > > > > > logcosðZ Þ þ ZL < 0; > > > > > j sinðZ ÞbZL j > > > arctan ; 0 < Z =2; > > j1 þ cosðZ ÞbZL j > > > > > =2; Z < =2 and logcosðZ Þ þ ZL ¼ 0; > > : =2; Z > =2 and logcosðZ Þ þ ZL ¼ 0:

VOL. 60, NO. 2,

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

M.G. Arnold, “Approximating Trigonometric Functions with the Laws of Sines and Cosines Using the Logarithmic Number System,” Proc. Eighth EuroMicro Conf. Digital System Design, pp. 48-55, 2005. M.G. Arnold, T.A. Bailey, J.R. Cowles, and C. Walter, “Analysis of Complex LNS FFTs,” Signal Processing Systems SIPS 2001: Design and Implementation, F. Catthoor and M. Moonen, eds., pp. 58-69, IEEE Press, 2001. M.G. Arnold, T.A. Bailey, J.R. Cowles, and C. Walter, “Fast Fourier Transforms Using the Complex Logarithm Number System,” J. VLSI Signal Processing, vol. 33, no. 3, pp. 325-335, 2003. M.G. Arnold, T.A. Bailey, J.R. Cowles, and M.D. Winkel, “Arithmetic Co-Transformations in the Real and Complex Logarithmic Number Systems,” IEEE Trans. Computers, vol. 47, no. 7, pp. 777-786, July 1998. M.G. Arnold and S. Collange, “A Dual-Purpose Real/Complex Logarithmic Number System ALU,” Proc. 19th IEEE Symp. Computer Arithmetic, pp. 15-24, June 2009. N. Brisebarre, F. de Dinechin, and J.-M. Muller, “Integer and Floating-Point Constant Multipliers for FPGAs,” Proc. Int’l Conf. Application-Specific Systems, Architectures and Processors, pp. 239244, 2008. N. Burgess, “Scaled and Unscaled Residue Number System to Binary Conversion Techniques Using the Residue Number System,” Proc. 13th Symp. Computer Arithmetic (ARITH ’97), pp. 250-257, Aug. 1997. J. Detrey and F. de Dinechin, “A Tool for Unbiased Comparison between Logarithmic and Floating-Point Arithmetic,” J. VLSI Signal Processing, vol. 49, no. 1, pp. 161-175, 2007. J. Detrey and F. de Dinechin, “Table-Based Polynomials for Fast Hardware Function Evaluation,” Proc. IEEE Int’l Conf. ApplicationSpecific Systems, Architecture and Processors, pp. 328-333, July 2005. F. de Dinechin, C. Klein, and B. Pasca, “Generating HighPerformance Custom Floating-Point Pipelines,” Proc. Int’l Conf. Field-Programmable Logic, Aug. 2009. F. de Dinechin, B. Pasca, O. Cret¸, and R. Tudoran, “An FPGASpecific Approach to Floating-Point Accumulation and Sum-ofProducts,” Field-Programmable Technology, pp. 33-40, IEEE Press, 2008. D.M. Lewis, “An Architecture for Addition and Subtraction of Long Word Length Numbers in the Logarithmic Number System,” IEEE Trans. Computers, vol. 39, no. 11, pp. 1325-1336, Nov. 1990. D.M. Lewis, “Complex Logarithmic Number System Arithmetic Using High Radix Redundant CORDIC Algorithms,” Proc. 14th IEEE Symp. Computer Arithmetic, pp. 194-203, Apr. 1999.

ARNOLD AND COLLANGE: A REAL/COMPLEX LOGARITHMIC NUMBER SYSTEM ALU

[14] R. Mehmke, “Additionslogarithmen fu¨r Complexe Gro¨ssen,” Zeitschrift fu¨r Math. Physik, vol. 40, pp. 15-30, 1895. [15] R. Muscedere, V.S. Dimitrov, G.A. Jullien, and W.C. Miller, “Efficient Conversion from Binary to Multi-Digit Multidimensional Logarithmic Number Systems Using Arrays of Range Addressable Look-Up Tables,” Proc. 13th IEEE Int’l Conf. Application-Specific Systems, Architectures and Processors (ASAP ’02), pp. 130-138, July 2002. [16] E.E. Swartzlander and A.G. Alexopoulos, “The Sign/Logarithm Number System,” IEEE Trans. Computers, vol. 24, no. 12, pp. 12381242, Dec. 1975. [17] E.E. Swartzlander et al., “Sign/Logarithm Arithmetic for FFT Implementation,” IEEE Trans. Computers, vol. 32, no. 6, pp. 526534, June 1983. [18] P.R. Turner, “Complex SLI Arithmetic: Representation, Algorithms and Analysis,” Proc. 11th IEEE Symp. Computer Arithmetic, pp. 18-25, July 1993. [19] P. Vouzis and M.G. Arnold, “A Parallel Search Algorithm for CLNS Addition Optimization,” Proc. IEEE Int’l Symp. Circuits and Systems (ISCAS ’06), pp. 20-24, May 2006. [20] P. Vouzis, M.G. Arnold, and V. Paliouras, “Using CLNS for FFTs in OFDM Demodulation of UWB Receivers,” Proc. IEEE Int’l Symp. Circuits and Systems (ISCAS ’05), pp. 3954-3957, May 2005. [21] P. Vouzis, S. Collange, and M.G. Arnold, “Cotransformation Provides Area and Accuracy Improvement in an HDL Library for LNS Subtraction,” Proc. 10th EuroMicro Conf. Digital System Design Architectures, Methods and Tools, pp. 85-93, Aug. 2007. [22] P. Vouzis, S. Collange, and M.G. Arnold, “LNS Subtraction Using Novel Cotransformation and/or Interpolation,” Proc. 18th Int’l Conf. Application-Specific Systems, Architectures and Processors, pp. 107-114, July 2007. [23] http://www.wikipedia.org/wiki/arctangent, Oct. 2008. [24] http://www.xlnsresearch.com, 2010.

213

Mark G. Arnold received the BS and MS degrees from the University of Wyoming, and the PhD degree from the University of Manchester Institute of Science and Technology (UMIST), United Kingdom. From 1982 to 2000, he was on the faculty of the University of Wyoming, Laramie. From 2000 to 2002, he was a lecturer at UMIST, United Kingdom. In 2002, he joined the faculty of Lehigh University, Bethlehem, Pennsylvania. In 1976, he codeveloped SCELBAL, the first open-source floating-point high-level language for personal computers. In 1997, he received the Best Paper Award from Open Verilog International for describing the Verilog Implicit To One-hot (VITO) tool that he codeveloped. In 2007, he received the Best Paper Award from the Application-Specific Systems, Architectures and Processors (ASAP) Conference. He is the author of Verilog Digital Computer Design. His current research interests include computer arithmetic, hardware description languages, microrobotics and embedded control, and multimedia and application-specific systems. He is a member of the IEEE. Sylvain Collange received the master’s degree in computer science from the Ecole Normale Suprieure de Lyon, France, in 2007. He is currently in the final year of a PhD program at the University of Perpignan, France. In 2006, he worked as a research intern at Lehigh University in Pennsylvania. He joined NVIDIA in Santa Clara, California, for an internship in 2010. His current research focuses on parallel computer architectures. His other interests include computer arithmetic and general-purpose computing on graphics processing units.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Lihat lebih banyak...

A Real/Complex Logarithmic Number System ALU

Descrição do Produto

Comentários