A reconfigurable fully parallel associative processor

May 30, 2017 | Autor: Isaac Scherson | Categoria: Computer Architecture, Distributed Computing, Image Processing, Parallel Processing, Parallel & Distributed Computing, Computer Graphic, Data transfer, High performance, Chip, Associative Memory, Reconfiguration, Dynamic Reconfiguration, Circuit Optimization, Computer Graphic, Data transfer, High performance, Chip, Associative Memory, Reconfiguration, Dynamic Reconfiguration, Circuit Optimization

Share Embed

Denunciar este link

Descrição do Produto

JOURNAL

OF PARALLEL

AND DlSTRlBUTED

COMPUTING

6,69-89

(1989)

A Reconfigurable Fully Parallel Associative Processor* ISAAC D. SCHERSON~ Department

of Electrical

AND SENER ILGEN

and Computer Engineering, University Santa Barbara, Cal.$ornia 93106

of California.

Received November 20. 1986

An associative parallel processing system using novel VLSI associative memory is described. By optimizing the VLSI associative memory architecture, the restrictions that are generally ascribed to classical associative memory systems no longer exist. High performance is achieved without sacrificing memory density due to enhancements in the peripheral circuitry rather than in the storage cell. Using the chip as a basic building block, a dynamically reconfigurable associative processor is proposed. A shifter interconnection network provides the interchip communication required to emulate different networks with a constant routing time for processor data transfers. It also allows for fast communication with a host computer. Parallel arithmetic algorithms are given, together with the application of the Associative Processor to diverse areas such ascomputer graphics, image processing, and list processing. Results are summarized to show the power of the proposed associative machine which proves to be a generalpurpose parallel processor with a broad range of applications. 0 1989 Academic press, 1~.

I. INTRODUCTION The inherent limitations of conventional computer architectures in solving computationally massive problems led to the development of several approaches in order to overcome speed constraints. Parallel processing emerged as one of these techniques and the cost-effective implementation and application of parallel computers became an important field of study. Parallel architectures were proposed with a few, but very powerful, processing elements (PEs) as well as systems with large numbers of simple processors. Microprocessors are widely used in parallel machines, due mostly to their low cost and * This work was supported in part by the National Science Foundation under Grants ECS8404627 and MlP86-033 19. t Present affiliation: Department of Electrical Engineering, Princeton University, Princeton, NJ 08544. 69 0743-7315189 $3.OO Copyright 63 1989 by Academic Press, Inc. AU rights of reproduction in any form reserved.

70

SCHERSON

AND

ILGEN

availability; however, other techniques exist that may provide a better solution in some cases. This paper deals with the implementation of associative memory and its incorporation in associative processors, which constitute a different approach to parallel computing. Special attention is given to the implementation of a VLSI associative memory chip suitable for the implementation of a reconfigurable fully parallel associative processing architecture. Using a novel interconnection network, linear or mesh connected computers can be emulated with constant diameters (routing steps performed in a constant number of processor cycles). The first associative memory was introduced by Slade and McMahon in 1956 [ 141. Data are retrieved by matching part, or all, of their contents to a known pattern. In the classical approach to associative memory, comparison takes place at each storage element, hence is to be performed simultaneously for all queried cells. This property can be exploited in algorithms for the parallel execution of arithmetic and logical operations among data items stored in the memory. Algorithms were developed in the early 1960s [4, 51, and an excellent account is given by Foster [6]. Fast searching and parallel arithmetic made associative memory an attractive device for which a wide variety of applications have been reported in the literature [ 11, 161. However, even with the fast progress of VLSI technology, little effort has been invested in developing suitable components for the implementation of sizable associative memory systems. Present RAM circuits have reached a density of 256K bits per chip, while only 16-bit associative memory chips are currently commercially available (INTEL 3 104, SIGNETICS 10155). Attempts have been made to overcome this limitation by suggesting associative memory cells suitable for large-scale integration [9, lo]. Unfortunately, none of these developed into a commercially available device. A number of machine architectures were suggested for emulation of the associative storage [ 11, 161. The most successful attempt is found in STARAN [ 11, the only associative computer commercially available. It processes data in a serial-by-bit parallel-by-word fashion using a linear array of l-bit processing elements, attached to a multidimensional access memory through a scrambling/unscrambling network. MPP, a 128 X 128-node mesh-connected massively parallel processor, also emulates associative operations using its lbit processing elements [2, 31. This emulation is bit-serial in all operations, which is inferior to a fully parallel associative memory which would compare, read, or write to all selected bits of a word in a single memory cycle. In this paper we present a novel and practical architectural approach that points toward the feasibility of truly associative processors. Based on an NMOS, 12-transistor associative memory cell [ 12, 131, a fully parallel associative memory chip was designed and fabricated. Using a 3-pm technology, a 32-word X 136-bit content-addressable array and peripheral registers and logic was successfully integrated on a single 7900 X 9200~pm chip. The main

RECONFIGURABLE PARALLEL ASSOCIATIVEPROCESSOR

71

contribution of the work reported here is the chip’s internal architecture. It was optimized to enhance the performance of parallel algorithms for arithmetic and logic operations in the word-oriented memory [7, 81. Using the chip as a basic building block, a reconfigurable fully parallel associative processor is proposed. It consists of an associative processing unit, a microprogrammed controller, which is also responsible for handling the communications with a host computer, and an interconnection network to enhance data transfer operations between associative words as well as global communications with the host. This interconnection network also enables dynamic reconfiguration of the processor array, increasing the area of applications of the Associative Processor. Image processing, pattern recognition, database management, and computer graphics are just a few of the problems that are efficiently solved utilizing the parallelism of the associative processor [7]. II. CLASSICAL ASSOCIATIVE MEMORY An associative memory can be defined as a memory system in which stored data words are identified according to their contents. It accepts as inputs a comparand (or operand) word and a mask word, searches all stored data locations simultaneously for a match between the masked bits of the comparand and the corresponding bits in all the stored words, and identifies matching data words (responders) by setting a marker or tag in a tag register. Similarly, masked bits of the comparand may be simultaneously written into all tagged data locations. Figure 1 shows our definition of a basic classical associative memory block. It consists of an L X K array (A) of content-addressable storage cells which are accessed through the comparand (C) and mask (M) registers. Results of comparisons are recorded in the tag (T) register. The associative memory model is capable of performing the following primitive operations: Forallk=O, l,..., K- landl=O, I,..., L- 1 (1) set tag register (2) Shift tag register (3) First select

T,:= 1 T, := T,,, T, := 1, for m = first responderlocation T, := 0, elsewhere (4) Count responders count := Sum T, Mk:=OorMk:= lorMk:=4 (5) Load mask (6) Load comparand Ck := 0 or C, := 1 or Ck := Ik (7) Compare T, := T/ZMk(Ak,0 C,) (8) Write Ak,:= TkAld+ TdM,C, + M,Ak,) (9) Read Ok:= II(Aw)Tk

The symbols Z and II stand for the boolean sum and product, respectively, + and 0 denote OR and Exclusive-OR, and 5 denotes the complement of z.

72

SCHERSON AND ILGEN

FIG.

1. The classical associative memory model.

It is seen that both write and compare operations work just on masked-on bits (Mk = 1) of words (rows) that are tagged (TI = 1). Hence the set responder register operation is required initially to make all words in memory eligible for processing. The “first select” operation is essential to break ties among multiple responders, and enable sequential access to them. The count responders operation generates the sum of all the T register bits (number of responders). It is useful for gathering statistics on stored data, or generating the sum of selected data elements stored in memory. Associative Parallel Arithmetic It has been claimed that the model above is capable of performing parallel logic and arithmetic among data items stored in the associative array. In order to illustrate this capability, let us consider addition into an accumulator. Assume that selected words have been logically partitioned into two N-bit fields, say A and B. The add-fields algorithm produces the sum A f A + B in all the selected words of the associative storage. Using a table-driven comparemodify algorithm, the operation is carried out sequentially by bit starting from the least significant. As will become apparent, it is convenient to define an operator which generates a binary vector of length K. This operator will be denoted by 6 and is defined as 6(k,, k2, - * - , k,,) + 6, = 1 =+6,=0

for all

m = k,, . . . , k,

for all

m # k,, . . . , k,.

RECONFIGURABLE

PARALLEL

ASSOCIATIVE

PROCESSOR

73

TABLE I PARALLEL

ADDITION

IN ASWCIATIVE

MEMORY

Addition Truth Table A

B

CanY

A

Carry

0 0 0 0 1 1 I 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

0 1 1 0 1 0 0 1

0 0 0 1 0 1 1 1

No action needed No action needed No action needed No action needed

Addition Algorithm STEP

MEMORY

CONTROL

0 1 2 3 4 5 6 I 8 9 10

C, M := 6(2N + 1); T := 1; COMPARE; M := 6(2h’); WRITE; M := 8(2N + 1, 2N, N + CNT, CNT); C := b(2N + 1, 2N); T := 1; COMPARE; C := 6(2N + 1, CNT); WRITE; C := 6(2N+ 1, 2N, CNT); T := 1; COMPARE; C := 6(2N + 1,2N); WRITE; C := b(2N + 1, N + CNT, CNT); T := 1; COMPARE; C := b(2N + 1,2N, N + CNT); WRITE; C:=6(2N+ l,N+CNT);T:= 1;COMPARE; C := 6(2N + 1, N + CNT, CNT); WRITE;

CNT:= 0 /* Cany = 0 */ /* IsABC= 00 I ?*/ /* Yes. AC := 1 0 */ /*ISA&Z’= lOI?*/ /* Yes. AC := 0 1 */ /+IsABC=l IO?*/ /’ Yes. AC := 0 1 */ /*IsABC=O10?‘/ /*YesAC:= IO*/ CNT := CNT + 1; IFCNT *-fm= -Ym=O

1

for all

m = k,, k2

for all

m # k,, k2.

The y operator has only two input parameters, since most associative arithmetic operations are bit-serial and handle at most two operands per word. It is implemented using two decoders to transform two inputs of 1ogzKbits each into two vectors of length K, each having only one bit set. Clearly, the K bits required by the 6 operator are reduced to 2 log, K bits for the y operator, and hence we allow for encoding and efficient use of the available I/O leads. To further decrease the pin count of the chip, the data bus is time multiplexed with the y operator-encoded input data. A bus input operation and a y load operation cannot occur concurrently. Associative algorithms depend on the heavy use of flag bits (such as a candidate flag to denote words participating in an operation, or a carry flag to hold the partial carry in arithmetic operations). Multiplexing the data bus between I/O and y data would necessitate sequential loading of the registers, requiring many cycles. This restriction led us to define a field of the registers as a “flag field” with direct connections to the pins of the chip. The flag field and the data field of the registers can be loaded with different data concurrently, thereby decreasing the delay caused by the need to change the flag bits of the registers every memory cycle. To enhance execution times of some very basic arithmetic and logical operations, the tag register structure was changed to include two separate registers which allow boolean operations between bit columns (also called bit slices) of the associative memory array. These registers are the word select (WS) register (to select the words that will participate in array operations) and the

76

SCHERSON AND ILGEN

search results (SR) register (to store the result of a compare operation). The boolean operations logically combine the contents of either the search results register or the word select register with the match data from a current compare operation. The two match registers also allow for selective writing of data to the memory words that are denoted by “true” or “complement” values in the SR register. Figure 2 shows the block diagram of the new associative memory chip, which we shall describe using realistic numbers for L and K rather than mislead the reader with “theoretical parameterized dimensions.” It consists of an associative memory array (AR) with 32 words of 136 bits each, and a set of registers, namely mask (MK), comparand (CD), word select (WS) and search results (SR). The comparand register contains pattern data for compare or write instructions while the MK register will enable or disable certain bits of the CD register, hence masking on or off the operation at given columns of the memory array. Word selection is achieved by the WS register. Comparison results are recorded in a number of different ways either in the SR or the WS registers. Array read operations generate the bit-wise “AND’ of the selected words on the data I/O pins. The number of responders to a compare operation is available at the count pins of the chip. It is obtained by a tree adder network which uses l-bit full adder blocks with the minimum number of levels, enabling the on-chip count operation to be finished in one clock cycle. The MK and CD registers have been divided into data and flag fields. The data field has 128 bits, while the flag field has 8 bits. In turn, the data field has four segments of 32 bits each. These segments are multiplexed to the chip’s data I/O pins (DO-D31). Only 32 bits can be loaded at any one time

MASK REGISTER

u

COUNT CO-C5 F’lG.

2. Proposed VLSI associative memory architecture.

RECONFIGURABLE

PARALLEL

ASSOCIATIVE

PROCESSOR

77

into the segment selected by the field select pins (SO-Sl). The flag field (MF and CF) is directly connected to the flag pins of the chip (FO-F7) to facilitate the access to frequently used bits in bit-serial algorithms (candidate flag, carry flag, . . .). The MK and CD registers have instructions to SET or CLEAR the whole data register (128 bits) or the bits pointed to by the y generator. The y generator input parameters (two ‘I-bit pointers) are fed through the data I/O pins. The WS register is used in memory operations to select specific words to be operated upon. If WS! contains a ZERO, no operation is done on the Zth word. It can be set, cleared, or used to store the result of a compare operation. Its contents can also be ANDed with the current compare result (to select bit columns that compared true) or its complement (to select bit columns that compared false). The WS register contents can also be loaded with the SR register value ORed with either the result of the current comparison or its complement. The result of a compare operation can be stored into the SR register in four modes: the SR register contents can be ANDed or ORed with the current compare result (true or negated). The AND operation “deletes” words from the SR register when words compared false, while the OR operation “adds” words to the SR register if they compared true. The contents of the SR register can be shifted up or down by 1 and pins are provided for chaining multiple chips (HO, H31): a means of bit-serial communication between words in memory. SR register contents are read out through the data I/O pins every cycle except when array data are output. This facility makes possible fast transfers between memory chips. Conversely, the SR register can be loaded through the data I/O pins. The count of ONES in the SR register is seen on the count pins of the chip (CO-C5). This count is the number of responders to a particular compare operation. The first select operation selects the first responder in the SR register. This operation is controlled by the inhibit in (Ii) signal and generates an inhibit out (IO) signal to denote responder status using modular look ahead inhibit generators with minimum delay. These signals can be used in chaining multiple chips together. It should also be noted that pins are provided for instructions to independently control the different registers and the cell array. Thus, flexibility is not sacrificed and maximum concurrency is allowed by providing an interface to a fully horizontal microprogramming scheme. From the preceding discussion, it can be seen that the proposed associative memory is capable of performing the following primitive operations: Forallk=O, I,..., K- landZ=O, I,..., L- 1 (I) (2) (3) (4)

Set word select register Clear word select register Write word select register Set search results register

ws,:= 1 ws, := 0 WS:=D SR,:= 1

SCHERSON AND ILGEN

78 (5) Clear search results register (6) Shift search results register (7) First select (8) (9) (10) (11) ( 12) (13) (14) (15) (16) ( 17) ( 18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31)

Count responders Read search results register Write search results register Load mask register Clear mask register Set mask register Load mask register flags Load comparand register Clear comparand register Set comparand register Load comparand register flags Set mask y bits Clear mask y bits Set comparand y bits Clear comparand y bits Compare: and true into SR Compare: or true into SR Compare: and true into WS Compare: and complement into WS Compare: and complement into SR Compare: or complement into SR Compare: true or SR into WS Compare: complement or SR into WS Write if true

(32) Write if false (33) Read array

SRI := 0 SRI : = SR,,, SR, := 1, for M = first responder location SR, := 0, elsewhere count := Sum SR,

D:=SR SR := D MK,:=D MKk := 0 MK/,:= 1 MF:= F CD,:= D CDk := 0 CDk:= 1 CF:= F MK,:= 1 M&:=0 CD, := 1 CD, := 0 SRI := SRI. ZMKk(AR,k 0 CDk), if WS, = 1 SRI := SR, + BM&(ARa 0 CD& if WS, = 1 WS, := WS,. ZM&(AR,k 0 CD,), if WS, = 1 WS, := WSr- ZM&(AR,k 0 CDk), if WS, = 1 SRI := SR,. ZMKk(ARlx 0 CD& if WS( = 1 SRI := SR, + ZM&(ARlk 0 CD,), if WSI = 1 WS, := SRI + ZMKk(ARrk 0 CDk), if WS, = 1 WS, := SRI + ;SM&(ARtk 0 CDk), if WS( = 1 ARu := (MKkCDt + mkAR,k), if WS, & SR,= 1 ARlk := (MKkCDk + MKkAR& if WSI & six,= 1 D := Il(AR,/JWS,

To show the increased performance of the new chip architecture, let us again consider the example of addition into an accumulator. We will again use Table I for the addition truth table and will now take advantage of the y generator and the two match registers. The algorithm, which is given in Table II, first finds words with BC = 01 or 10, and then writes the new value AC = 10 or 01 according to the previous A value. The new algorithm takes 5N cycles to terminate, as opposed to 9N in the classical memory, which is almost a twofold speedup.

IV.

FULLY

PARALLEL

ASSOCIATIVE

PROCESSOR

@PAP)

The proposed associative memory can be effectively incorporated in the design of a bit-parallel, word-parallel associative processor. In this section we illustrate this idea by means of example. Consider a 16K-word fully parallel

RECONFIGURABLE

PARALLEL

ASSOCIATIVE

PROCESSOR

79

TABLE II ADDITION

ALGORITHM

(NEW)

STEP

MEMORY

0

CD, MK := 0; WS, SR := 1; CF, MF := (2N + 1); CSAT; MF := (2N + 1,2N); WRITE; MK := y(CNT); CD := 0; WS, SR := 1; CF, MR := (2N + 1,2N); CSAT; CD := r(CNT); CF := (2N + 1); CWST; CD, MK := r(N + CNT); SR := 1; CF, MF := (2N + I); CSAT; CD := 0; CF, MF := (2N + 1,2N); WRITE; CD := y(N + CNT); CF := (2N + 1); WRITE COMPLEMENT;

1 2 3

4 5 6

CONTROL CNT := 0 I* Carry = 0 */ /* x 0 1 ? */ /* x 1 0 ? */ /* 1 x x ? */ CNT := CNT + 1; IF CNT < N GOT0 2

Note. CSAT is “Compare into SR register AND of TRUE match” and CWST is “compare into WS register OR of SR register and TRUE match.”

associative processor consisting of a processing section and a control section. Referring to Fig. 3, the processing section consists of 5 12 associative memory chips interconnected via a high-speed bus which has two separate paths into the memory. One common bus links all data I/O pins to external data while the other links chips to a shifter interconnection network. An optional adder network can be used to integrate the count of responders to a given compare operation. If multiple-response resolution is necessary, a simple resolver net-

I

Mu> COMMON 6

1

9 LEVEL TREE ADOER

SHIFTER NETWORK

FIG. 3. The fully parallel associative processor (FPAP).

I

80

SCHERSON AND ILGEN

work handles the first responder selection between associative memory chips. For the associative processor, a clock frequency of 10 MHz has been chosen. This means a cycle time of 100 nsec for the associative memory, which can be attained using current VLSI technology. The shifter network shown can be thought of as a collection of 32 X 5 12bit circular shift registers. Each associative memory chip has connections through the data I/O pins to a bit slice of the shift register. In one clock cycle, every memory chip can load/store 32 bits of data from/to the shifter network registers. The shifter network data can be shifted up or down for external loading/storing or between-chip transfers. Loading of the 16K words of memory with 32-bit data values would be accomplished by transferring 5 12 32bit data from the external source to the shifter network with shift operations and transferring the data from the shifter network to the 5 12 memory chips in one clock cycle. This process has to be repeated 32 times to load all the associative words with 32-bit data from the external source, providing an order-of-magnitude improvement in data transfer speeds over a “common bus” scheme (up to 320 Mbytes/set). Aside from its use in loading or storing of external data, the shifter network serves also as the communication link between the associative memory chips. It has the ability to shift the data in its registers by 1, 2, 3, 4, 5, 6, 7, 8, 16, 32, and 64 places in one clock cycle and can be reconfigured for a different count every clock cycle. The shifter network can be implemented in VLSI using only simple shift registers and multiplexer% A 64-word X l-bit shifter chip will require eight g-bit shift registers, eight 4- 1 multiplexers, and 86 pins. Only 256 such chips will be required for the shifter network. Using the shift capabilities of the shifter network and the SR register, the Associative Processor can be configured to operate as either a linear or a rectangular array of processors. The rectangular array configurations will be obtained in row major order. The simulated processors will be mapped onto FPAP as shown in Fig. 4. The shift operations within a memory chip are handled by the SR register, and the shift operations between memory chips are handled by the shifter network. Let sd denote the desired shift distance in a routing operation which will transfer data from all the processors to all processors a distance sd apart. For a linear array configuration, if sd is less than 32, only the SR register is used for routing. If sd is greater than 32, the shifter network is used to transfer data from all processors to those processors that are a multiple of 32 units apart. The SR register is used to complete the data routing within a chip. For rectangular array configurations, parallel row transfers are handled by the shifter network, again shifting all the words in parallel by the required amount. Column transfers may require the use of the shifter network if sd is greater than 32. The SR register again will be used to shift the data within a chip, as in the linear array configuration. Since an associative memory chip has 32 words, linear shifts within a chip will take at most (sd modulo 32) cycles per

I

RECONFIGLJRABLE

PARALLEL

ASSOCIATIVE 32

090 1

PROCESSOR

81

0:31

I,0 I:31

2,o 3.0

3

FIG.

4. Rectangular array configurations mapped onto FPAP.

bit (circular shifts can be done in at most 3 + (sd modulo 16) cycles). The shifter network which has 5 12 words will take at most 2 + (5 12/64) + 4 = 14 cycles to shift linearly any number of words. For any array configuration, the data can be routed in parallel between two processors in at most (sd modulo 32) + 14 cycles per bit. For a linear array configuration, a shift operation with sd = 10,000 will take 25 cycles per bit. For a 128 X 128 rectangular array configuration, all columns can be routed horizontally 100 processors units in 7 cycles per bit. Note that it takes 9 cycles per bit to shift all rows in the vertical direction. The reconfiguration of the Associative Processor is dynamic, in the sense that the array configuration (which is determined only by the SR register and shifter network shift counts) must be encoded in the algorithms. The same data can be processed with a 1 X 16,384, 32 X 5 12,64 X 256, or 128 X 128 mesh-connected FPAP. Thus a 128 X 128 array configuration can emulate mesh-connected computers (such as MPP), with much faster routing times, hence considerably reducing the communications overhead. Also, the edge connections of the rectangular configurations can be programmed to generate

82

SCHERSON AND ILGEN

arrays that look like a plane, a cylinder, or a torus. The disadvantage of the reconfigurability is the loss of location independence of the data that is a characteristic of an associative memory system. Figure 3 shows that a tree adder network is used to count responders. Every memory chip has a &bit count output which will generate valid data before the end of the current compare cycle. These outputs can then be fed into nine adder stages to get the final 15bit sum of the responders. This nine-level sum can be generated in one clock cycle with current technology, which means that the responder count operation will require two cycles to terminate, If the operation of an algorithm depends on the availability of the count, the algorithm should delay its operation one cycle for the sum to become valid. For some operations this wait is not necessary, since the count result can be received by the host computer independently of the operation of the associative processor. Image histograming is such an example of a pipelined compare and count operation. The algorithm will compare the memory contents with different data every clock cycle without waiting for the count result, since the resulting count of responders will be picked up by the host with one-cycle pipeline delay. The first select operation is accomplished by using modular look-ahead inhibit signal generators. The inhibit output lines of all the chips are connected to the levels of the inhibit signal generators, which generate the inhibit in signals for all the chips in three gate delays. This operation will take one cycle after a compare operation, which will require a delay of one cycle to be inserted into the operation of the algorithm when the first select is taking place. The last inhibit out line, if active, indicates that there are no responders. To relieve the host of the burden of controlling the associative processor, a microprogrammed (fully horizontal) controller is utilized to sequence the associative processor and handle the interface with the host. The host interface consists of a set of registers as seen in Fig. 3. The OPCODE register holds the opcode of the processor instruction that has been programmed into the microprogram unit. yA and yB are two 7-bit up/down counters which contain the y operator pointer data. They constitute the dynamic parameters passed to the microprogram control to indicate the position of operand fields in the associative memories. The CNT register (a down counter) also holds a dynamic parameter, indicating the field length over which the macroinstruction will be executed. After these registers are loaded by the host with appropriate data for an operation, the microprogrammed controller will utilize these registers to control the associative processor. The SS (Segment Select) register selects the 32-bit segment in the 128bit memory word and enables load and store operations on that segment. The INPUT register holds data to be stored into the memory and the OUTPUT register holds data read from the associative memory. An RSP register is available to hold the response count.

RECONFIGURABLEPARALLELASSOCIATIVEPROCESSOR

83

The timing relationship of the different components of the subsystem has been carefully investigated to attain maximum throughput. The associative memory has been designed to facilitate the concurrent operation of the mask, comparand, search results, and word select registers with the storage array. When the clock is high, the register operations occur concurrently. When the clock is low, the memory operation is done with register data. This greatly increases the speed of operation.

V. COMPUTINGONTHE

ASSOCIATIVE PROCESSOR

Associative algorithms have been under investigation for the last 25 years. Logical and arithmetic algorithms which include bitwise logical and comparison operations and simple integer operations (addition, subtraction, multiplication, and division) have been devised for different associative architectures. Although the inherent nature of the associative memory lends itself to parallel computations, to the best of our knowledge, floating point algorithms have been reported only by Thurber and Patton in [ 151. This might be the result of the difficulty in shifting the mantissae in the floating point addition and subtraction algorithms. Different mantissae might require shifting by different amounts, turning the parallel application into a sequential one. The algorithms for our associative processor architecture are implemented as macroinstructions by writing microprograms for the micro-controller unit. For logical operations, the extra functionality of the MK and CD registers is used to its full extent to achieve the speed required from a parallel processor. For integer operations, two tag registers are used to minimize the number of operations done for each bit. The addition algorithm (Table II) takes five memory cycles for each bit of the operands. Assuming a cycle time of 100 nsec, this comes to 8.2 msec to add two 16-bit integers. If we use 16K words, we have a peak processing power of 2000 Million Operations Per Second (MOPS). For floating point operations, by utilizing a novel shift-normalization method the parallelism of the associative memory is exploited and reasonably fast running times are obtained. To multiply two floating point numbers exponents are first added using the add-fields algorithm. The mantissae are then multiplied by repetitive application of add fields and the resulting product is finally normalized (at most a l-bit shift is required). This multiplication operation is straightforward in the sense that it is divided into simpler integer operations. Floating point addition, however, needs alignment of the fractions. The exponent’s difference is computed at all participating word locations to determine the alignment distance. Mantissae are then shifted in a bit-serial fashion. Alignment proceeds as follows:

84

SCHERSONANDILGEN

The least significant bit of the difference is checked for a ONE. For all responder words, the 23-bit mantissa is shifted right by a l-bit position, producing a 24-bit fraction. An N-bit field can be shifted in 2N cycles, using one compare and one write instruction per bit. The second bit of the difference is considered next in all the words in the memory. All the (24-bit) mantissae are shifted right by 2 bits if the difference bit is a ONE. The mantissa is now 26 bits. This operation is applied to all the bits of the difference (5 bits) and the alignment is finished. The aligned mantissae are then added in the usual integer add way and normalized. Floating point operations can use the standard floating point number representation with different numbers of bits for the mantissa and the exponent. For the following timing analysis, m and e denote the number of mantissa and exponent bits, respectively. Floating point operation Addition Subtraction Multiplication Division

Timing (clock cycles) 39m

+ 23e -

100

39m

+ 23e -

100

5m2+5m+8e+4 lm2+

lOm+8e+4

For a floating point representation with 1 sign bit, 8 exponent bits, and 23 mantissa bits, floating point addition has been estimated to perform in 95 msec. If we add 16K 32-bit floating point numbers in parallel, we obtain a peak processing power of 170 MFLOPS. The performance of selected arithmetic and logic algorithms that will eventually become the basis for the macrolanguage of the associative processor has been evaluated. A summary of results is given in Table III.

VI. ASSOCIATIVE PROCESSORAPPLICATIONS

The applications in this section are given from a wide range of topics, proving the applicability of FPAP to different problems. We believe that FPAP can be used as a general-purpose parallel processor as long as the balance between the parallel and sequential operations is distributed evenly between the Associative Processor and the host. Computer Graphics-Ray

Tracing

Ray tracing is a technique to render images on a CRT screen. Using massive processing power, ray tracing creates the most realistic graphics images. The application of FPAP to ray tracing will be considered next. Ray tracing requires a sequence of operations to be performed for every pixel of the screen. First, for the pixel under consideration, a ray will be fired. The intersection of this ray with all the objects has to be calculated. The

RECONFIGURABLE

PARALLEL

ASSOCIATIVE

85

PROCESSOR

TABLE III ALGORITHM TIMINGS Processing power (t = 100 nsec, n = 16,384) Algorithm Addition Subtraction Multiplication Division Increment Complement Negation Comparison Floating point Addition Subtraction Multiplication Division

Time 5m 5m 5m2+3m+18 lm’ + 36m + 31 4m 3m 5m 2m + 1

m=8 4096 4096 453

212 5 120 6827 4096 9638

MOPS MOPS MOPS MOPS MOPS MOPS MOPS MOPS

m=

16

MOPS MOPS MOPS MOPS MOPS 3414 MOPS 2048 MOPS 48 19 MOPS 2048 2048 122 68 2560

941 941 2750

4001

m = 32

1024 MOPS 1024 MOPS 31 MOPS 20 MOPS 1280 MOPS 1706 MOPS 1024 MOPS 2410 MOPS 167 MFLOPS 167 MFLOPS 58 MFL.OPS 41 MPLOPS

intersected objects will then be sorted and the object closest to the viewer will be selected. The incoming ray may be reflected (refracted) from the surface, which will result in a new ray to be traced as before. For the following discussion the objects are assumed to be composed of triangles. One associative word will contain the x, y, and z coordinates of the three vertices of a triangle and the plane equation coefficients A, B, C, and D, as well as the intersection point z value. The intersection calculation requires the generation of plane equations of the triangles. The plane equations can be obtained by the use of the following: A.x+B.y+C.z+D=O A = (yi - &) * (Zi + Zj) B = (Zi - Zj) * (Xi + Xj) C = (Xi - Xj)*(Yi + Yj) D= -(Aax+ B-y+ C-z). For triangles, these plane equations can be generated with 26 additions and 9 multiplications per triangle. The intersection calculation also needs the value of the intersection point of the ray with the plane of the triangle, which can be obtained from the plane equation z= -(,4.x+ B.y+ D)/C. This calculation needs 2 additions, 2 multiplications, for each triangle.

and 1 division, again

86

SCHERSON AND ILGEN

Finally, we must apply a containment test to check if the intersection point lies within the triangle. This test requires the calculation of the sign of the following equation for every edge of each triangle: s =

((Y

-

Yl)

* (x2

-

XlMX

-

Xl)

* (Yz

-

VI)>.

This equation requires 12 additions, 6 multiplications, and 3 divisions per triangle. Once all the intersections are calculated, the triangle with the maximum intersection value has to be selected using the parallel compare and responder count operations of the Associative Processor. As can be seen, ray tracing requires quite an amount of processing for each triangle of the image. Assuming there are enough bits per word to store one triangle and its temporary data into one word, FPAP needs to execute these triangle operations only once, in parallel for all the triangles. Using 16K words, we can process images with 16K triangles, which is reasonable for complex pictures. Image Processing-Template

Matching

This image processing operation searches for occurrences of a rectangular template in the image. A 128 X 128-pixel image and a 7 X 7-pixel template with 16 bits per pixel are common values for the template matching operation. A fast searching scheme using the Associative Processor is given next. Assume the 128 X 128 image is loaded into FPAP in row major order, one pixel per word. Also assume that the template is 7 X 7 pixels. The search will begin with the (7, 7) element of the template and continue by decreasing column values until the first column is reached. Then the previous row will be processed (6, 1) with increasing column values. At the end of this snakelike search, the SR register of FPAP will contain ones corresponding to those locations which matched against the template. The algorithm is as follows: (1) Set all bits of the SR register. This means that every word has the possibility of matching to an element of the template. Set i and j to 7. (2) Compare the (i,j) element of the template to the Associative Processor words with SR values of 1. This step changes the contents of the SR register, so that a word is cleared if there is a mismatch to the template value. (3) If i is odd, shift SR register contents up by 1. Else, shift SR register contents down by 1. This step actually indexes all the possible words to be considered next. (4) If i is odd and j is not 1, decrement j and go back to step 2. If i is even and j is not 7, increment j and go back to step 2. Otherwise, shift the SR register contents up by 128 using the shifter network. If i is not 1, decrement i and go back to step 2.

RECONFIGURABLE

PARALLEL

ASSOCIATIVE

PROCESSOR

87

(5) This step is reached if i andj are both 1. This is the end of the search. The SR register will contain a 1 at a word if and only if the template is matched to the image starting from that point. As can be seen, the template matching operation can be accomplished with simple compare and shift instructions. One row of the template matching problem requires eight Associative Processor cycles. With the setup of FPAP, total template matching will require 7 * 8 + 2 = 58 cycles. This gives a processing rate of 2824 millions of pixels per second. List Processing

A growing number of artificial intelligence applications require fast list processing languages. This requirement is met by programming in Lisp, preferably on special Lisp machines. This section will deal with implementation of lists and primitive list operators on the Associative Processor. An atom is an indivisible entity which has a specific value or meaning. A, FOO, CC are examples of atoms. A list is a collection of atoms, separated into groups by the use of parentheses. (A FOO) and (A B (FOO CC)) are examples of lists. Lisp defines some primitives to operate on Lists. CAR returns the first element of a list. CDR returns a list containing all except the first element. CONS constructs a new list by inserting a new first element to a list. Lists can be represented in many ways. One possible representation is a binary tree. A node can be a list pointer node (which has two pointers to other nodes) or an atom node (which is a leaf of the tree). Figure 5 shows the binary tree representation of the list (A (B C) (D E F) G).

FIG. 5. Tree

representation

of the list (A (B C) (D E F) G).

88

SCHERSON

AND

ILGEN

In the Associative Processor, one node will be stored in one word. Each word will have fields to define the type of node (List or Atom), node name or number, left and right child names, value of an atom, and status to help in garbage collection. Figure 6 shows the representation of the list (A (B C) (D E F) G) on FPAP. As can be seen, the pointers are replaced with node names since traversing a tree will be done using content searches, not linear addressing. Implementation of CAR, CDR, and CONS primitives is straightforward. CAR is obtained first by searching for node number 0, which is root. The left child name of root points to the first list node of the list. The left child name value of this list node is the required pointer. CDR is obtained by the same procedure, except that the right child name value is used at the last step. For CONS, first the element to be added will be stored into the Associative Processor as an atom node. Then, a list node will be created to link this atom to the old list. The left child name of root will be transferred to the newly created list node. The left child name of root will be changed to point to the new list node. As can be seen, these operations can be accomplished by simple searches on FPAP. CAR and CDR each need two search and two read operations.

FIG. 6. FPAP

representation

of the list (A (B C) (D E F) G).

RECONFIGURABLE

PARALLEL

ASSOCIATIVE

PROCESSOR

89

CONS needs two node creations, one search, and three write operations. So, the effective operation rate for CAR and CDR is 40,000 MOPS. CONS will need approximately 10 cycles, with an operation rate of 16,000 MOPS. Apart from the speed of operation, garbage collection is simplified. Since no ordering of nodes in the Associative Processor is necessary (location independence), no free storage linked list is necessary. The status bit is used to denote used or empty status of a node. Garbage collection can just reset the status bits of unused entries to empty.

REFERENCES 1. Batcher, K. E. STARAN parallel processor system hardware. Proc. 1974 National Computer Conference (AFIPS), pp. 405-4 10. 2. Batcher, K. E. Design of a massively parallel processor. IEEE Trans. Comput. C-29,7 (1980), 836-840. 3. Batcher, K. E. Bit serial parallel processing systems. IEEE Trans. Comput. C-31, 5 (1982), 377-384. 4. Estrin, A., and Fuller, R. Algorithms for content-addressable memories. Proc. I963 IEEE Pacific Computer Conference, pp. 24 l-253. 5. FalkoIT, A. D. Algorithms for parallel search memories. J. Assoc. Comput. Mach. 9,5 (1962), 488-511. 6. Foster, C. C. Content Addressable Parallel Processors. Van Nostrand-Reinhold, Princeton, NJ, 1976. 7. Ilgen, S. A reconfigurable fully parallel associative processor. Ph.D. thesis, Department of Electrical and Computer Engineering, University of California, Santa Barbara, 1987. 8. Ilgen, S., and Scherson, I. D. Parallel processing on VLSI associative memory. Proc. International Conference on Parallel Processing, Chicago, 1987. 9. Koo, J. T. Integrated circuit CAM. IEEE J. Solid State Circuits SC-5, 2 (1970), 208-215. 10. Mundy, J. L., and Joynson, R. E. A high density associative memory. Proc. 1971 IEEE Computer Society Conference, pp. 189-l 90. 11. Parhami, B. Associative memories and processors: An overview and selected bibliography. Proc.

IEEE

6 (1973),

722-730.

12. Ruhman, S., and Scherson, I. D. Associative memory cell and memory unit including same. U.S. Patent No. 4,404,653, Sept. 1983. 13. Scherson, I. Multi-operand associative processing and application to tomography and computer graphics. Ph.D. thesis, Department of Applied Mathematics, Weizmann Institute of Science, 1983. 14. Slade, A., and McMahon, H. 0. A Cryotron catalog memory system. Proc. 1956 Eastern Joint Computer Conference, pp. 115- 120. 15. Thurber, J. K., and Patton, P. C. Hardware floating point arithmetic on associative processor. Proc. 1972 IEEE Compcon, pp. 275-278. 16. Yau, S. S., and Fung, H. S. Associative processor architecture-A survey. Comput. Surveys 9, I (1977), 3-27.

Lihat lebih banyak...

A reconfigurable fully parallel associative processor

Descrição do Produto

Comentários