The MorphoSys Parallel Reconfigurable System

June 7, 2017 | Autor: Fadi Kurdahi | Categoria: System Simulation, High performance, Parallel Systems

Descrição do Produto

The MorphoSys Parallel Recon gurable System Guangming Lu, Hartej Singh, Ming-hau Lee, Nader Bagherzadeh, Fadi Kurdahi Department of Electrical and Computer Engineering The University of California at Irvine Irvine, CA 92715 USA fglu, hsingh, mlee, nader, [email protected] Eliseu M. C. Filho Department of Systems and Computer Engineering COPPE/Federal University of Rio de Janeiro P.O. Box 68511 21945-970 Rio de Janeiro, RJ Brazil [email protected]

Abstract This paper introduces MorphoSys, a parallel system-on-chip which combines a RISC processor with an array of coarse-grain recon gurable cells. MorphoSys integrates the exibility of general-purpose systems and high performance levels typical of application-speci c systems. The rst MorphoSys prototype is currently at an advanced stage of implementation and it will operate at 100 MHz. Simulation results presented here show signi cant performance enhancement for dierent classes of applications, as compared to conventional architectures.

Submitted to the 1999 EuroPar Conference Conference Track #9: Parallel Computer Architecture

The MorphoSys Parallel Recon gurable System Abstract. This paper introduces MorphoSys, a parallel system-on-chip which

combines a RISC processor with an array of coarse-grain recon gurable cells. MorphoSys integrates the exibility of general-purpose systems and high performance levels typical of application-speci c systems. The rst MorphoSys prototype is currently at an advanced stage of implementation and it will operate at 100 MHz. Simulation results presented here show signi cant performance enhancement for dierent classes of applications, as compared to conventional architectures.

1 Introduction Computing systems based on the traditional von Neumann model provide a single and generic computational substrate for applications with diverse characteristics. These systems have wide applicability but, due to their generality, they may not match the computational needs of many applications. At the other end of the spectrum, there are systems with architectures customized for particular applications. These systems are built around one or more Application-Speci c Integrated Circuits, or ASICs. The architecture of an ASIC exploits intrinsic characteristics of an application's algorithm that lead to a high performance. However, the direct architecture{algorithm mapping restricts the range of applicability of ASIC-based systems. A recon gurable computing system is a hybrid approach between the design paradigms of general-purpose systems and application-speci c systems. They combine a software programmable processor and a recon gurable hardware component which can be customized for dierent applications. This combination allows recon gurable systems to achieve performance levels much higher than that obtained with general-purpose systems, with a wider exibility than that oered by application-speci c systems. This paper introduces the MorphoSys parallel recon gurable system. MorphoSys (Morphoing System) features a novel architecture for recon gurable computing systems. It has promising potential to satisfy the increasing demands for high-performance, low cost stream/frame data processing. It is primarily targeted to applications with inherent parallelism, high regularity, word-level granularity and computation-intensive nature. Some examples of such applications are video compression, image processing, multimedia and data security. However, MorphoSys is exible enough to support bit-level and irregular applications. The remainder of this paper is organized as follows. Section 2 gives a brief overview of recon gurable computing systems in general. Section 3 presents the MorphoSys architecture and emphasizes its unique features. Section 4 discusses the status of the MorphoSys prototype currently under development. Section 5 shows performance gures for important applications mapped to MorphoSys. Finally, Section 6 presents the main conclusions.

2 Background The basic architecture of a parallel recon gurable system [1] comprises a software programmable core processor and a recon gurable hardware component. The core processor executes sequential tasks of the application and controls data transfers between the programmable hardware and data memory. In general, the recon gurable hardware is dedicated to exploitation of parallelism available in the application's algorithm. This hardware typically consists of a collection of interconnected recon gurable elements. Both the functionality of the elements and their interconnection is determined through a special con guration program, called the context. We introduce a set of criteria for classifying recon gurable system designs. These are: granularity, depth of programmability, recon gurability and interface coupling. System granularity is de ned by the internal structure of the recon gurable elements. In ne-grain systems, these elements are composed of logic gates and

ip- ops. Each element operates at the bit level, implementing a boolean function or a nite-state machine. Examples of such recon gurable systems are Splash [2] and DECPeRLe-1 [3]. In coarse-grain systems, the con gurable elements contain complete functional units, like ALUs and/or multipliers, that operate upon multiple-bit data words. Matrix [4] is an example of a coarse-grain recon gurable system. In terms of depth of programmability, a recon gurable system may have a single-context or multiple-contexts. For single-context systems, only one con guration program (context) may be resident in the system. In this case, the system's functionality is limited to the context currently loaded. On the contrary, in multiple-context systems, several contexts can be resident in the system at once. This allows execution of dierent tasks simply by changing the operating context. Recon gurability pertains to the ability of the system to overlap execution with loading with new context. In statically recon gurable systems, recon guration of the programmable hardware can occur only if the current execution is interrupted or when it nishes. On the other hand, in dynamically recon gurable systems, recon guration can be done concurrently with execution. The interface coupling of a recon gurable system refers to the level of integration of the core processor and the recon gurable hardware. The system is tightly-coupled if the core processor and the programmable component reside in the same chip. The system is loosely-coupled if core processor and programmable logic are implemented as separate devices.

3 The MorphoSys System Figure 1 shows the MorphoSys architecture. It comprises ve components: the core processor, the Recon gurable Cell Array (RC Array), the Context Memory, the Frame Buer and a DMA Controller. Some features described here have

evolved from experience with the current MorphoSys prototype, and will be implemented in the next version in order to enhance mapping exibility.

TinyRISC Core Processor RC Array (8 x 8) Main Memory

Frame Buffer (2 x 128 x 64)

DMA Controller

Fig. 1.

Context Memory (2 x 8 x 16)

Architecture of the MorphoSys recon gurable system.

3.1 Core Processor The core processor, also known as TinyRISC, is a MIPS-like processor with a 4-stage scalar pipeline. It has sixteen 32-bit registers and three functional units: a 32-bit ALU, a 32-bit shift unit and a memory unit. An on-chip data cache memory minimizes the accesses to external main memory. In addition to typical RISC instructions, TinyRISC's ISA is augmented with speci c instructions for controlling other MorphoSys components. These special instructions fall in two categories: DMA instructions and RC Array instructions. DMA instructions initiate data transfers between main memory and the Frame Buer, and context loading from main memory into the Context Memory. RC Array instructions control the operation of the recon gurable component (the RC Array), by specifying the context and the broadcast mode (cf. Subsection 3.3).

3.2 Recon gurable Cell The Recon gurable Cell (RC) is the basic programmable element in MorphoSys. As Figure 2 shows, each RC comprises ve components: the ALU-Multiplier, the shift unit, the input multiplexers, a register le with four 16-bit registers and the context register. There are 64 RCs, arranged as an 8 8 matrix called the RC Array (cf. Subsection 3.3). The ALU-Multiplier has four data input ports. Two 16-bit ports receive data from the input multiplexers, one 32-bit port takes data from the output register and a 12-bit port takes an immediate value in the context word. In addition to

operand bus

16 0

reg

rs_ls

wr_exp

31 scnt mux_a mux_b alu_op

context register

imm

MUX A

MUX B

16

12 16

16 R0 R1 R2 R3

ALU-Multiplier

Register File

32 Shifter

32

16

32 output register

to result bus, express lanes and other RCs

Fig. 2.

Architecture of the Recon gurable Cell (RC).

standard arithmetic and logical operations, the ALU-Multiplier can perform a multiply-accumulate operation in a single cycle. The shift unit is also 32 bits wide. In the current MorphoSys prototype, the ALU-Multiplier operates only on signed numbers. However, several important applications, such as data encryption/decryption (cf. Subsection 5.2), involve multiplication of unsigned numbers. Therefore, the ALU-Multiplier will be extended for operation using both signed and unsigned values in the next implementation of MorphoSys. The input multiplexers select one of several inputs for the ALU-Multiplier. Multiplexer MUX A selects one input from: (1) four nearest neighbors in the RC Array, (2) other RCs in the same row/column within the same RC Array quadrant (cf. Subsection 3.3), (3) the operand data bus (cf. Subsection 3.5), or (4) the internal register le. Multiplexer MUX B selects one input from: (1) three of the nearest neighbors, (2) the operand bus, or (3) the register le. The context register provides control signals for the RC components through the the context word. The bits of the context word directly control the input multiplexers, the ALU/Multiplier and the shift unit. The context word determines the destination of a result, which can be a register in the register le and/or the express lane buses (cf. Subsection 3.3). The context word also has a eld for an immediate operand value.

3.3 RC Array

The RC Array consists of an 8 8 matrix of Recon gurable Cells. An important feature of the RC array is its three-layer interconnection network, which is depicted in Figure 3. Figure 3(a) shows the nearest neighbor layer that connects the RCs in a twodimensional mesh. Thus, each RC can access data from any of its row/column

quadrant RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

(a)

(b) Fig. 3. Interconnection layers in the RC Array.

neighbors. Figure 3(a) also depicts the second layer, which provides complete row and column connectivity within a quadrant. Therefore, each RC can access data from any other RC in its row/column in the same quadrant. Figure 3(b) shows the third layer, which supports inter-quadrant connectivity. It consists of buses called express lanes, that run along the entire length of rows and columns, crossing the quadrant borders. An express lane carries data from any one of the four RCs in a quadrant's row (column) to the RCs in the same row (column) of the adjacent quadrant.

3.4 Context Memory The Context Memory stores the con guration program (context) for the RC Array. The Context Memory is logically organized into two context blocks, each block containing eight context sets. Each context set has sixteen context words. The major focus of the RC array is on data-parallel applications, which exhibit a de nite regularity. Following this principle of regularity and parallelism, the context is broadcast on a row/column basis. The context words from one context memory block are broadcast along the rows, while context words from the other block are broadcast along the columns. Each block has eight context sets and each context set is associated with a speci c row (column) of the RC Array. The context word from a context set is broadcast to all eight RCs in the corresponding row (column). Thus, all RCs in a row (column) share a context word and perform the same operations. Recall that a context word is stored in the context register within each RC (cf. Subsection 3.2). A context plane is formed by the corresponding context words within each context set across the Context Memory. As there are sixteen context words in a

context set, up to sixteen context planes can be simultaneously resident in each of the two blocks of Context Memory.

3.5 Frame Buer and DMA Controller The Frame Buer is an internal data memory logically organized into two sets, called Set 0 and Set 1. Each set is further subdivided into two banks, Bank A and Bank B. Each bank 64 rows of 8 bytes (therefore, the entire Frame Buer has 128 16 bytes). A 128-bit operand bus is used to transfer data operands from the Frame Buer to the RC Array. This bus is connected to the column elements of the RC Array. The cells along a RC Array row share the same 16-bit segment of the operand bus. In this way, eight dierent operands can be loaded into all cells of an RC Array column in just a single cycle. In the current MorphoSys prototype, the operand bus operates in interleaved mode. As Figure 4(a) shows, in this mode the operand bus carries one byte data values from the two Frame Buer banks in the order A0 ,B0 ,A1 ,B1 ,...,A7 ,B7 where An and Bn denotes the nth byte from banks A and B, respectively. Each RC receives two bytes of data, one from Bank A and the other from Bank B. This operation mode is appropriate for some common image processing applications involving template matching, that compare two 8-bit operands. However, there are application classes that require data transfers in 16-bit units. To accommodate this requirement, the operand bus in the next implementation of MorphoSys will be recon gurable. This will allow an additional data transfer mode called contiguous mode. As shown in Figure 4(b), in this mode each segment of the operand bus carries two bytes from the same Frame Buer bank. In this mode, the order of the data values in the operand bus is A0 ,...,A7 ,B0 ,...,B7 . Each RC receives two consecutive bytes of data from either Frame Buer Bank A or B. Results from the RC Array are written back to the Frame Buer through a separate 128-bit bus, called the result bus. The physical connection of the result bus to the RC Array is similar to that of the operand bus, i.e., 16-bit bus segments running along the rows. Once again, application mapping experience has indicated the need for exibility in the result bus. In the current MorphoSys prototype, the result bus operates in an 8-bit mode. As illustrated in Figure 4(c), in this mode each cell of an RC Array column provides an 8-bit result, forming a 64-bit word which is written into Frame Buer Bank A or B. But, in some applications, it is necessary to write back 16-bit data from each RC. To satisfy this requirement, the result bus in the next implementation of MorphoSys will also be recon gurable, to enable an additional 16-bit mode. As depicted in Figure 4(d), the 16-bit results from the rst four cells of an RC Array column will be written into Frame Buer Bank A, while the 16-bit results from the remaining four cells will be written into Bank B. The DMA controller performs data transfers between the Frame Buer and the main memory. It is also responsible for loading contexts into the Context

Result Bus Configuration Modes

Operand Bus Configuration Modes

RC Array rows A0

A0 To Array row 0

8

To Array row 0

B0

A1

A1 To Array row 1 B1

8

From FB Bank A

8 To FB Bank A or B

A6

From FB Bank A

To Array row 3 A7

From FB Bank B

8 8

B0 To Array row 4 B1

8 8

From FB Bank B A7

8 64

RC Array rows

0 1 2

16 To FB Bank A

16 64 16

3

16

4 5 6

16 To FB Bank B

7

16 64 16 16

0 1 2 3 4 5 6 7

8-bit

16-bit

(c)

(d)

B6 To Array row 7

B7

To Array row 7 B7

Interleaved

Contiguous

(a)

(b)

Fig. 4.

Con guration modes of the operand and result buses.

Memory. The TinyRISC core processor uses DMA instructions to specify the necessary data/context transfer parameters for the DMA controller.

3.6 MorphoSys Execution Flow Model The execution model of MorphoSys is based on partitioning applications into sequential and data-parallel tasks. The former are handled by TinyRISC core processor whereas the latter are mapped to the RC Array. TinyRISC initiates all data transfers involving application and con guration data. RC Array execution is enabled by TinyRISC through one of several special context broadcast instructions. While RC Array performs computations on data in one Frame Buer set, fresh data may be loaded in the other set or Context Memory may receive new contexts. TinyRISC controls the context broadcast mode and also provides various control/address signals for Context Memory, Frame Buer and DMA controller.

3.7 Important Features of MorphoSys MorphoSys is a coarse-grain, multiple-context recon gurable system with considerable depth of programmability (32 contexts) and two dierent context broadcast modes. It provides a high degree of exibility for application mapping, by oering two levels of recon gurability: 1. Operand-type con gurability: this is con gurability that allows switching between classes of applications with dierent data types. It aects two aspects of the system: (1) con guration of the array multiplier in each RC as either a signed or unsigned multiplier and (2) con guration of the operand bus (as

either interleaved or contiguous) and of the result bus (as either 8-bit or 16-bit). 2. Functional con gurability: this is the short-term, run-time recon gurability level. It controls RC functionality and RC Array connectivity on a cycle-tocycle basis. The hierarchical RC Array interconnection network also contributes for algorithm mapping exibility. Structures like the express lanes enhance global connectivity. Even irregular communication patterns, that otherwise require extensive interconnections, can be handled eciently. For instance, an eight-point butter y can be accomplished in only three cycles. Finally, bus con gurability supports applications with dierent data sizes and data ow patterns. MorphoSys is a highly parallel system. Such parallelism is evident not only on the existence of multiple functional elements (the RCs), but also on how information can be moved within the system. First, MorphoSys is dynamically recon gurable. While the RC Array is executing one of the sixteen contexts in row broadcast mode, the other sixteen contexts for column broadcast can be reloaded in parallel into the Context Memory (or vice-versa). Secondly, RC Array computations using data in one Frame Buer set can proceed in parallel with data transfers from/to the other Frame Buer set. The internal Frame Buer and DMA controller, and the adoption of wide datapaths, allow high-bandwidth transfers for both data and con guration information.

4 Implementation of MorphoSys MorphoSys is tightly-coupled recon gurable system. The TinyRISC core processor, the RC Array and the remaining components are to be integrated into a single chip. The rst implementation of MorphoSys is called the M1 chip. M1 is being designed for operation at 100 MHz clock frequency, using a 0.35 m CMOS technology. The TinyRISC core processor and the DMA controller were modeled using VHDL, and CAD tools were used to perform RTL/layout synthesis from this structural description. The other components (RC Array, Context Memory and Frame Buer) were completely custom designed. The nal design will be obtained through the integration of both synthesized and custom parts. The M1 chip is programmed through a software development environment. There is a SUIF-based C compiler for the TinyRISC core processor and a simple assembler-like parser for context generation. A GUI tool called mView supports interactive programming and simulation. Using mView, the programmer can specify the functions and interconnections corresponding to each context for the application. mView then automatically generates the appropriate context le. As a simulation tool, mView reads a context le and displays the RC outputs and interconnection patterns at each cycle of the application execution.

5 Algorithm Mapping and Performance Analysis We now provide examples of algorithm mapping to MorphoSys. To demonstrate the exibility of MorphoSys, the examples presented here are representative of two application classes with diverse characteristics, namely, image processing and data encryption/decryption.

5.1 Image Processing Application: DCT/IDCT Image processing is a key component of a wide range of applications. We use the Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) as examples typifying this application area. Both transforms are part of the JPEG and MPEG standards. For example, MPEG performs image compression through a combination of DCT and quantization. Generally, a two-dimensional DCT (2-D DCT) is performed on an 8 8 pixel matrix (the size in most image and video compressing standards). A 2-D DCT can be achieved by applying 1-D DCT to the rows of the pixel matrix, followed by 1-D DCT on the results along the columns. For high throughput, the eight row (column) 1-D DCTs can be computed in parallel. A commonly used fast one-dimensional DCT (1-D DCT) algorithm [5] operating on an 8-pixel block involves 26 additions and 16 multiplications. When mapping 2-DCT to MorphoSys, the RC multipliers are con gured for signed mode, while the operand and result buses are con gured for interleaved and 8-bit modes, respectively. For an 8 8 pixel matrix, each pixel is mapped to a RC. Coecients needed for the computation are provided as constants in the context words. To perform eight 1-D DCTs along the rows (columns) of the pixel matrix, context is broadcast along the columns (rows) of the RC Array. As MorphoSys has the ability to broadcast context along both rows and columns of the RC Array, the need of transposing the pixel matrix is eliminated, thus saving a considerable amount of cycles. The operand data bus is wide enough to load eight pixels at a time, therefore the entire pixel matrix can be loaded in eight cycles. Once data is in the RC Array, two butter y operations are performed to compute intermediate variables. As mentioned (cf. Subsection 3.7), inter-quadrant connectivity provided by the express lanes enables one butter y operation in three cycles. As the butter y operations are also performed in parallel, only six cycles are necessary to accomplish the butter y operations for the whole matrix. Row/column 1-D DCTs take 12 cycles. Two additional cycles are used for data re-arrangement. Finally, eight cycles are needed for result write back. With 8-bit pixels, the throughput of the 2-D DCT algorithm on MorphoSys is given by:

8) 8 throughput = (8 + (8 6 + 12 + 2 + 8) bits=cycle For a 10 ns clock cycle, the above expression gives a throughput of 1.44 Gbps. Figure 5 shows the relative performance gures for 2-D DCT on MorphoSys and

other systems. sDCT is a software implementation written in optimized Pentium assembly code using 64-bit special MMX instructions [6]. REMARC [7] is another recon gurable system, targeting multimedia applications. V830R/AV [8] is a superscalar multimedia processor. TMS320C80 [9] is a commercial digital signal processor. Execution of DCT/IDCT on MorphoSys results in a speedup of 6X as compared to a Pentium MMX-based system. For the DCT algorithm, MorphoSys yields a throughput much better than that of the considered hardware designs. cycles

350 300 250 200 150 100 50 0 M orphoSys I

Fig. 5.

REM ARC

V830

sDCT

TM S

Performance comparison for DCT/IDCT application.

5.2 Data Encryption/Decryption Application: IDEA Today, data security is a key application domain. The International Data Encryption Algorithm (IDEA) [10] is a typical example of this application class. IDEA involves processing of plaintext data (i.e., data to be encrypted) in 64-bit blocks with a 128-bit encryption/decryption key. The algorithm performs eight iterations of a core function. After the eighth iteration, a nal transformation step produces a 64-bit ciphertext (i.e., encrypted data) block. The algorithm uses 52 16-bit sub-keys, generated from the 128-bit key. IDEA employs three operations: bitwise exclusive-or, addition modulo 216 and multiplication modulo 216 + 1. When mapping IDEA to MorphoSys, each RC multiplier is con gured in the unsigned mode. The operand bus and the result bus are con gured for the contiguous and 16-bit modes, respectively. As the encryption/decryption key does not change frequently, the sub-keys are generated externally and then loaded once into the Frame Buer sets. Some operations of IDEA's core function can be performed in parallel, while others must be performed sequentially due to data dependencies. The maximum

number of operations that can be performed in parallel is four. In order to exploit this parallelism, clusters of four cells in the RC Array columns are allocated to operate on each plaintext block. Thus, the whole RC Array can operate on sixteen plaintext blocks in parallel. Two 64-bit plaintext blocks can be transferred simultaneously using the 128bit operand bus. Thus, it takes only eight clock cycles to load sixteen plaintext blocks into the entire RC Array. Each iteration of the core function takes seven clock cycles to execute in a cell cluster. The nal transformation step needs one additional cycle. Once the ciphertext blocks have been produced, it is necessary to write them back to the Frame Buer before loading the next plaintext blocks. This takes another eight cycles. Therefore, the performance of the IDEA algorithm as mapped to MorphoSys is given by:

throughput = (8 + (816 7)64 + 1 + 8) bits=cycle For a 10 ns clock cycle time, this expression gives a throughput of 1.4 Gbps. Figure 6 shows the relative performance of MorphoSys and other implementations of the IDEA algorithm. sIDEA is a software implementation on a Pentium II processor. HiPCrypto [10] is an ASIC chip that implements IDEA in hardware. It exploits the parallelism available in the IDEA algorithm by using multiple functional units in a seven-stage pipeline.

12 10 8 6

C ycles

4 2 0 sID EA

Fig. 6.

H iPC rypto

MorphoSys

Performance comparison for the IDEA encryption/decryption algorithm.

In order to factor out technology-related aspects from the comparison, performance is measured as the number of cycles necessary to obtain a ciphertext block. The IDEA algorithm running on MorphoSys is more than two times faster than running on an advanced superscalar processor. The pipeline of the HiPCrypto

chip nishes 7 ciphertext blocks in every 49 cycles (assuming that the pipeline is continuously full). In MorphoSys, the RC Array provides 16 ciphertext blocks in 73 cycles. Thus, MorhoSys is also faster than a single HiPCrypto chip.

6 Concluding Remarks In this paper, we described the MorphoSys parallel recon gurable system. We also presented results assessing the performance of MorphoSys for important applications in the imaging processing and data security domains. For the DCT and IDEA algorithms, MorphoSys is signi cantly faster than software implementations running on high-performance, superscalar general-purpose microprocessors. Moreover, MorphoSys delivers a performance close or better to that provided by some ASIC implementations of those two algorithms. Overall, this paper demonstrates that the combination of general-purpose processors with recon gurable hardware blocks represents a potential design paradigm to address the performance needs of future applications and for the microprocessors of the next decade.

7 References [1] W. H. Mangione-Smith et al., Seeking Solutions in Con gurable Computing, IEEE Computer, Dec. 1997, pp. 38{43. [2] M. Gokhale et al., Building and Using a Highly Programmable Logic Array, IEEE Computer, Jan. 1991, pp. 81{89. [3] P. Bertin, D. Roncin, J. Vuillemin, Introduction to Programmable Active Memories, in Systolic Array Processors, J. McCanny Ed., Prentice-Hall, Englewood Clis, NJ, 1989. [4] E. Mirsky and A. DeHon, MATRIX: A Recon gurable Computing Architecture with Con gurable Instruction Distribution and Deployable Resources, Proc. of IEEE Symposium on FPGAs for Custom Computing Machines, 1996, pp. 157{166. [5] W-H Chen, C. H. Smith, S. C. Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, IEEE Trans. on Communications, Vol. 25, No. 9, Sept. 1997, pp. 1004{1009. [6] Application Notes for Pentium MMX, http://developer.intel.com/drg/mmx/appnotes. [7] T. Miyamori, K. Olukotun, A Quantitative Analysis of Recon gurable Coprocessors for Multimedia Applications, Proc. of the IEEE Symposium on Field Programmable Custom Computing Machines, 1998. [8] T. Arai et al., V830R/AV: Embedded Multimedia Superscalar RISC Processor, IEEE Micro, Mar./Apr. 1998, pp. 36{47. [9] F. Bonimini et al., Implementing an MPEG2 Video Decoder Based on TMS320C80 MVP, SPRA 332, Texas Instruments, Sept. 1996. [10] S. Salomao, V. Alves, E. C. Filho, HipCrypto: A High Performance VLSI Cryptographic Chip, Proc. of the 1998 IEEE Conference on Application-Speci c Integrated Circuits, pp. 7{13. This article was processed using the LATEX macro package with LLNCS style

Lihat lebih banyak...

The MorphoSys Parallel Reconfigurable System

Descrição do Produto

Comentários