A design study of a 0.25-μm video signal processor

Share Embed


Descrição do Produto

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

501

A Design Study of a 0.25- m Video Signal Processor Santanu Dutta, Member, IEEE, Kevin J. O’Connor, Senior Member, IEEE, Wayne Wolf, Fellow, IEEE, and Andrew Wolfe, Member, IEEE

Abstract—This paper presents a detailed design study of a highspeed, single-chip architecture for video signal processing (VSP), developed as part of the Princeton VSP Project. In order to define the architectural parameters by examining the area and delay tradeoffs, we start by designing parameterizable versions of key modules, and we perform VLSI modeling experiments in a 0.25m process. Based on the properties of these modules, we propose a VLIW (very long instruction word) VSP architecture that features 32–64 operations per cycle at clock rates well in excess of 600 MHz, and that includes a significant amount of on-chip memory. VLIW architectures provide predictable, efficient, high performance, and benefit from mature compiler technology. As explained later, a VLIW video processor design requires flexible, high-bandwidth interconnect at fast cycle times, and presents some unique VLSI tradeoffs and challenges in maintaining high clock rates while providing high parallelism and utilization. Index Terms— Circuit simulation, crossbar network, design tradeoffs, video signal processing, VLIW architecture.

I. INTRODUCTION

V

IDEO SIGNAL processing (VSP) requires very high computation rates, and a growing number of applications (complex compression techniques, video games, etc.) also require sophisticated algorithms. While early video processing systems relied on hardwired modules to provide performance, advances in VLSI technology are making possible programmable video signal processors. To justify the effort involved in designing a programmable processor, a VSP architecture should provide very high performance over a wide range of video applications. They should be efficiently programmable in high-level languages by application-oriented developers, and they should take advantage of available parallelism within VSP applications with minimal intervention from programmers. The fundamental questions in VSP architectures are whether programmable architectures can deliver the required performance and, if so, what architectural features and parameters are necessary to achieve both performance and flexibility. The work mentioned in the paper is part of a university research project—the Princeton VSP project—whose goal is to define, design, and evaluate a parallel architecture and compilation tools for high-performance, programmable video signal Manuscript received February 26, 1997; revised October 27, 1997. This work was supported by the National Science Foundation under Contract MIP9408462 and by Philips Semiconductors. The work of A. Wolfe was supported in part by ONR, DARPA, NSF, Motorola, AT&T and the State of New Jersey. This paper was recommended by Associate Editor N. Demassieux. S. Dutta is with Philips Semiconductors, Sunnyvale, CA 94088 USA. K. J. O’Connor is with Lucent Technologies, Murray Hill, NJ 07974 USA. W. Wolf is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA. A. Wolfe is with S3 Inc., Santa Clara, CA 95052 USA. Publisher Item Identifier S 1051-8215(98)05765-6.

processors. We are using an application-driven, simulationbased methodology for comparing various architectural alternatives. In order to do so, we need to characterize the types of applications that will benefit from using a programmable video processor. Our first set of benchmark applications concentrates on commonly used compression methods, while we gather additional application benchmarks for later experiments. A critical factor in designing new architectures is developing an understanding of the feasible implementation tradeoffs in order to bound the design space for experiments. Since we are exploring microarchitectures that are considerably more parallel than anything currently available and will be implemented in an advanced, state-of-the-art IC technology, it is necessary to derive cost (area) and performance models for the components of the microarchitectures under consideration. In order to define the architectural parameters by examining the area and delay tradeoffs, we have designed parameterizable versions of key modules and have performed VLSI modeling experiments in a 0.25-m process. This paper analyzes the results of our experiments and proposes the design of a VSP architecture that features 32–64 operations per cycle at clock rates well in excess of 600 MHz. The remainder of the paper is organized as follows. Section II points to some existing VSP architectures, discusses the motivations for using a VLIW-based architectural model for VSP, and enumerates the critical research issues. Section III presents various technology-based simulation results for critical components within a VLIW video processor, and derives VLSI performance models for these modules. Based on the simulation results, Section IV proposes a VLIW architectural framework, and analyzes different design tradeoffs. Section V draws the conclusions, and proposes some directions for future research. II. ALTERNATIVE ARCHITECTURAL MODELS FOR VSP’s We have decided to focus on the design of a programmable VSP architecture that can support a whole range of interesting applications. Programmable processors not only provide an economy of scale, where the development costs can be amortized over many different types of applications, but are also invaluable as a research and development tool—they motivate the development of new types of applications, and facilitate commercialization by providing a low-volume mechanism for implementation during the early phases of a product’s life cycle. We believe that VLSI technology, by the end of this decade, will be able to fabricate high-performance, real-time, programmable VSP chips where the mainstream CMOS technology (0.25-m) will provide both the density for interesting VSP architectures and the speed to make programmable so-

1051–8215/98$10.00  1998 IEEE

502

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

lutions competitive enough to coexist with dedicated VSP chips. A number of companies—both major semiconductor houses and start-ups—have already been designing such programmable processors for the last several years. Notable among them are the multimedia video processor (MVP) [1] chip from TI and the VSP-1 [2] chip from Philips. It is to be noted that, while most of the earlier VSP architectures were multiprocessor designs, the more recent ones attempt to exploit the fine-grained parallelism inherent in the application and, hence, resort to long instruction words. The Philips Trimedia chip (TM-1) [3], for example, is a 100-MHz multimedia engine with a very long instruction word; it is a system on a chip with a powerful VLIW core surrounded by an array of autonomous DMA units. The Mpact processor [4] is another recent media processor architecture, featuring a VLIW instruction format with multiple arithmetic units. The Mpact processor, however, has a proprietary instruction set, and so some details of the architecture are not publicly available. The architectural design described in this paper differs from the TM-1 and other available VLIW-based VSP chips in several ways: our design is at much higher clock speeds, we concentrate on the core processor (without any peripherals), and we consider architectures with much more on-chip memory. We have set the following development goals for our design: • high performance on a single chip (well above 600-MHz clock rate with about 64 operations per cycle); • a 0.25-m CMOS technology, • efficiently programmable using high-level languages; • capable of meeting hard real-time deadlines, • applicable to a very broad range of VSP applications; • cost effective; • scalable over a range of performance levels within the same architectural paradigm; • usable in multiprocessing systems. Our initial study of the above issues indicates that VLIW architectures, indeed, are excellent candidates for meeting these design goals. It is highly unlikely that any 0.25-m CMOS system can approach anywhere near 10 GHz clock rates; therefore, it is critical to execute multiple operations per cycle to reach the required performance level. In fact, it will require tens of operations per cycle to achieve the desired performance. This is far more than any current superscalar microprocessor. Our models, however, indicate that it will be feasible to reach the required level of parallelism and performance using a VLIW paradigm. To attain the required performance, the architecture must take advantage of the parallelism inherent in the applications. While many of the traditional kernels in VSP applications contain much loop-level parallelism, some of the more aggressive modern algorithms do not. VLIW compilers and architectures can exploit both the data parallelism in regular loops and instruction-level parallelism in less structured code. More radical VLIW variants such as XIMD [5] can also exploit task-level parallelism to further increase utilization. We believe that VLIW architectures can provide higher performance on a wider variety of applications compared to

some other approaches, particularly when based on compiled code. This is because: • VLIW compilation methods are relatively mature, • VLIW systems provide a flexible framework for developing new domain-specific parallelization algorithms, • the flexibility of VLIW architectures encourages application developers to think about advanced algorithms and data structures that are not effective on more restrictive parallel architectures. Many of the issues that have inhibited widespread adoption of VLIW architectures for general-purpose computing are not relevant to VSP. Binary compatibility, with existing architectures and among architectural generations, has not traditionally been important for programmable signal-processing chips. Long compile times are unimportant since most users will not develop their own VSP code. Adequate parallelism to keep many functional units (FU’s) busy has been a problem in some applications, but VSP applications appear to have an abundance of parallelism at several levels. Implementation complexity has slowed the development of VLIW architectures in the past, but IC density has increased to the level that singlechip, high-performance VLIW processors are practical today. One issue, however, that will continue to be a problem for VLIW processors in the VSP domain is poor code density. We have not yet addressed this issue, but it may be critical for practical systems. It is also unclear whether a VLIW architecture provides the best cost/performance tradeoffs for any particular application. Less general approaches may be more effective, but this will be impossible to evaluate until complete chips and systems are designed. III. ANALYSIS

OF

VLSI CONSTRAINTS

While we are focused on the specific requirements for VLIW video signal processors, much of the basic research is applicable to other advanced VLIW architectures as well. One of the main characteristics of our architecture is the need for a very large number of function units. Most of the recent work in VLIW architectures has focused on general-purpose applications, and thus has concentrated on achieving maximum utilization from a moderate number of function units. One needs to refer back to some of the earliest VLIW work [6] to find in-depth evaluations of the problems of very wide VLIW architectures. These early machines were primarily designed to execute scientific code that had abundant parallelism, much like the VSP code today. In fact, we are likely to reexplore many of the same issues as the designers of the ELI and the TRACE 28/300 [7], but with radically different technological constraints. For this reason, we have focused most of our initial experiments on an attempt to quantify these technological constraints in the context of digital video processing. A. Architectural Framework and Modeling Methodology Our overall goal for the design of a programmable VSP chip is quite general. Given a fixed amount of silicon and a fixed implementation technology, we want to get the highest possible performance (in terms of both cycle time and operations per cycle) for a wide range of VSP applications based on

DUTTA et al.: STUDY OF 0.25-m VSP

Fig. 1. Idealized VLIW model.

compiled code. It is difficult to evaluate such a broad criterion during the development process, but we have attempted to abstract this into some simpler design principles. Since most VSP applications have quite a bit of available parallelism, we want to fit as many functional units as possible on the die, while attempting to maintain very high clock rates and as much generality and homogeneity in the architecture as possible. Unfortunately, these are often incompatible desires. Increasing the number of FU’s generally slows down the clock cycle. Furthermore, global structures, which enhance the schedulability of the architecture for the compiler, often increase cycle time. The goal of our technology-based modeling experiments is to obtain some quantitative information on which to base these design tradeoffs. We need to give up global structures where they have a serious impact on cycle time or where they require so much die area that they preclude including more FU’s. Luckily, VSP applications show significant locality if properly scheduled, and low-latency interconnect does not seem to be as important as it is in general-purpose code. On the other hand, we need to maintain a simple model for compilation. Much of the published research on VLIW architecture and compilation has been based on a global-register-file architecture as shown in Fig. 1. Multiple FU’s access operands directly from a multiported register file. The register file serves two functions in the architecture—temporary storage for operand data and interconnect mechanism between FU’s. Each FU consumes three register-file ports in order to facilitate two operand reads and one result write-back in the same cycle. Such an architecture is ideal for the compiler. Random access with uniform delay is provided to all operands in the processor from all of the operators. This allows the compiler the freedom to move any operation from one FU to another without retiming the schedule. Latency can be minimized for many complex calculations with dense data-dependency graphs. Unfortunately, it is very difficult to build large global register files. We are aware of two attempts, from CMU [8] and IBM [9], to design and fabricate large, multiported registerfile chips; both describe register files with eight read ports and eight write ports. The intent is to double these chips to create 24-ported register files for VLIW architectures. However, even in a 0.25-m technology, it is very unlikely that any such

503

register-file implementation will reach the less than 2-ns cycle time that is of interest in our current project. Furthermore, large register files tend to slow down as a function of the square of the number of ports, and a VLIW VSP chip with 16–64 FU’s is likely to require 48–192 register-file ports. At the same time, the register file should have enough registers so that each FU has a reasonable number of active registers to work with. Our conclusion is that such large global register files are infeasible for VLIW architectures at the speed and level of parallelism of interest to video signal processing. There have been other credible efforts to build single-chip VLIW architectures, but with far fewer FU’s. The VIPER chip [10] from UC Irvine includes four pipelined FU’s that share an eight-ported, global register file. The data cache is accessible from two of the FU’s. This chip provides a great deal of global connectivity, but at a moderate clock rate (25 MHz in a 1.2m technology) and with a small number of FU’s. The other VLIW chip, LIFE1 from Philips [11], uses a crossbar network with local, two-ported register files to provide connectivity among seven FU’s. This poses some scheduling hazards to the compiler, but the resulting chip is noticeably faster (50 MHz) in a similar (1.2 m) technology. Given the high-speed requirements of VSP, we have eliminated the possibility of using a global register file, and have examined schemes based on local registers and global interconnect. Fig. 2 shows the floorplan and the general concept of the proposed VLIW VSP data path. Clusters of FU’s are connected by a global interconnect network. Each functionalunit cluster (node) provides some number of operators and local registers as well as local data memory. It seems most logical to organize each node as one or more FU’s connected to a multiported register file, where each FU comprises one or more arithmetic units connected to multiple banks of local data memory. The global interconnect network provides statically scheduled communications between the clusters. The local memory communicates with the external world (memory) via a DMA-controlled memory bus. Control flow is typical of any VLIW architecture, with a distributed instruction register directly controlling all operations and resource allocation. Instructions are supplied to the machine through a distributed instruction cache. Sixteen-bit integers are the only supported native data type. With the basic architecture defined, we now need to consider some of the key architectural design trade offs. What is the best topology for the interconnect network, and how many nodes can we afford to support? How much local storage can be provided at each node, both in the form of registers and memory? How many and what type of FU’s should be placed at each node? How many ports can each local register file support? What other local interconnect is required within the node in addition to the register file? What is the pipeline structure of the network and each local node? Designing the best detailed architecture for an actual system requires simulating real applications with a real compiler; however, the design space for this type of a VLIW chip is huge, and the tradeoffs are difficult. Heavily interconnected global structures increase the effective use of parallelism and reduce cycle count, but they can increase cycle time. High-speed

504

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

Fig. 2. VLIW VSP general data-path model.

circuits reduce cycle time, but increase area, reducing the number of parallel-processing (functional) units that can fit on a chip. In order to obtain quantitative information about the essential elements of a VLIW VSP architecture, we have set up a number of technology-based experiments, and have designed our own modules for key storage and interconnect components. These designs have been used to compute their area requirements and simulate their timing behavior in order to understand various design tradeoffs for performance opti-

mizations. The process begins with the design of a typical layout for each module and extraction of the key physical parameters. These parameters are then used to construct Cbased simulation-model generators. The generators are used to create circuit-simulation decks (for various components) that are simulated to obtain the timing data. The area data are obtained from empirical equations that we derived; these equations are functions of layout and circuit parameters, and are used by the simulation-model generators to predict how the

DUTTA et al.: STUDY OF 0.25-m VSP

designs scale as their requirements are modified, for example, how the area changes when the size of a memory is doubled. We have used a 0.25-m experimental process for our logic design, circuit design, and layout. All simulations have been performed at 3.0 V and nominal temperatures using AT&T’s ADVICE circuit simulator. We have been fairly conservative in our designs, and have used only two levels of metal for all of our module designs, reserving the upper two layers for intermodule connections and running power rails. The next several sections outline our designs, and analyze the circuit simulation results for some of the key VSP modules. The analysis provides area and cycle-time information that defines the feasible architectural design space. This information can be used by a prototype simulator to experiment with different architectural tradeoffs. B. Modeling, Evaluation, and Design of the Global Interconnect Even though a common register file, shared by different functional-unit clusters, can be used as the primary network1 interconnecting the clusters, we believe that a single global register file is not a feasible solution for both operand storage and global interconnect, and it is probably better to use the register file as a secondary network providing local interconnection among the FU’s in each cluster. For global interconnection purposes, therefore, a different network structure must be implemented. Since we intend to implement fine-grain parallelism on VLIW architectures, low latency through the network is of particular importance. Also, since the entire machine will stall when there are resource conflicts in the network, there is greater motivation to avoid network resource conflicts. The choice of the network is, therefore, a critical step in the design process. Network topology is a relatively well-understood problem in the world of parallel computing using multiprocessors. We, however, are interested in identifying/designing a network that is suitable for a VLIW architecture. The difference between a network-based multiprocessor and an instruction-level parallel VLIW processor is somewhat subtle. A multiprocessor consists of numerous independent processors connected by either a synchronous or an asynchronous network. When a similar network is used to connect synchronously clocked processing elements and all operations and network traffic are statically scheduled, the resulting processor has all the characteristics of independence architectures [12] such as VLIW. It is, however, important to reevaluate the choice of network topology for VSP applications. In a different paper [13], we evaluated some interconnection networks, including crossbar and several multistage interconnect topologies, and discovered that in most cases, multistage interconnect networks have longer latency than single-stage (crossbar) networks at the level of complexity that we modeled. Furthermore, it is much more difficult to lay out the multistage networks than the crossbar, and thus they consume more area. On the other hand, it is quite 1 Such a hardware model, consisting of four FU’s and a branch unit connected to a common register file, has been suggested by Fisher [12].

505

easy to pipeline the multistage networks, making them useful when fine-grain parallelism is not critical. From a system-level perspective, it is preferable to use the crossbar network since it reduces the complexity of scheduling operations at compile time. Unfortunately, the delay and the area requirements of the crossbar network tend to scale nonlinearly with the number of ports, whereas the multistage networks tend to scale at a lesser rate. We, however, believe that crossbar networks map well to single-chip CMOS systems, and thus will be efficient up to a relatively large size. The key issue is whether the cost of a crossbar network is acceptable at the level of integration that we are considering. In order to understand the design implications, we have simulated different approaches to designing crossbar networks, and have finally decided on a multiplexer-based network design with a folded layout. The details of the design are discussed next. 1) Multiplexer-Based Crossbar Design: This design is based on binary tree crosspoint selection [14]–[16], resulting in improved switching speeds. Of four different designs of the selector (crossbar cell) given in Choi’s paper [14], we have selected the one based on 2-to-1 multiplexers. Our crossbar design, therefore, features a [(log2 N ) 0 1]-stage multiplexer at each output line to provide switching, decoding, and buffering. A typical multiplexer-based network cell and a single-bit version of an eight-port network are shown in Fig. 3. Different rows are connected to each column with a tree structure, realized with a vertical arrangement of multiplexer-based cells. The main feature of this arrangement is that each cell’s output is connected to only one input (of the next cell in the tree), thus minimizing the capacitance loading associated with each input. Due to the light loading, each of the selected inputs ripples through a cascade of selectors/multiplexers with minimal delay. Note that, in order to realize a b-bit network, the single-bit design (in Fig. 3), with the exception of the decoders, is replicated b times, one below the other; in such a situation, however, adjacent columns must be spaced further apart in order to allow b output lines to pass through and come out at the bottom. Note that each network cell in this new design requires a constant number of gates with constant fan-in. For an N -port 1-bit network, the design requires at least (2 log2 N +1) vertical lines per column because there are log2 N select lines, log2 N vertical channels needed to connect the multiplexers in a column, and one extra vertical line per port for each output bit. For our implementation, however, we have used 2 log2 N control lines per column.2 An N -port b-bit network, therefore, features N columns with (3 log2 N + b) vertical lines per column. 2) Crossbar Folding: All of our previous discussions pertain to Fig. 3, and assume that the network inputs and outputs are available at the periphery—the inputs come from the left and the outputs exit at the bottom. Note, however, that this may not be the design of choice—we want each functionalunit cluster in our design to write its output on only one data 2 Both (true and complement) rails are needed for routing the control signals because the multiplexers in each cell are implemented using CMOS transmission-gate switches. An alternate implementation can have log 2 control lines and local inverters to obtain the complemented signal at each cell, but we believe that running one extra vertical channel per control bit is more area efficient than local inversions in all of the cells.

N

506

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

(a)

(b) Fig. 3. A single-bit design of an eight-port crossbar network. (a) A cell based on 2-to-1 multiplexers. (b) A crossbar circuit based on 8-to-1 multiplexers (as columns) built from 2-to-1 multiplexers.

line (using a fixed connection), but to be able to selectively read (input) from any one of the data lines3 (using the crossbar switches). In order to facilitate such reading and writing, the most convenient floorplan of the processor architecture is one that has clusters of FU’s placed on both sides of the crossbar, as shown in Fig. 4, driving signals from the center rather than routing them to the periphery.4 If we try to incorporate the above layout idea in the design of the crossbar, we find that the width of a 16-bit crossbar becomes prohibitive because of the need to accommodate both the input and the output data lines (to and from the crossbar, respectively) as vertical routings. Such an area-inefficient design will also make the 3 For

a b-bit network, the data lines are b-bit wide. example cited in Fig. 4 is for illustration only and, therefore, comprises a 2-bit wide data path. 4 The

delay performance of the network unacceptable, and so, while searching for a method to alleviate the problem, we have come up with a novel idea of folding the crossbar so that its area and delay remain quite reasonable, even when the functional-unit clusters are distributed around a central network structure. The basic idea behind the folding technique, as illustrated in Fig. 5, is to reflect every alternate multiplexer tree (column) about its vertical axis so that the vertical routing channels for the input and the output lines can be shared (as highlighted by the shaded regions in Fig. 5) between adjacent columns (for example, between the top and bottom clusters). The delay and area characteristics of the folded-crossbar design for different network sizes, network bit widths, and driver sizes are presented in Tables I–VI. For our simulations, we have used 1.7-m transistors for both the NMOS and

DUTTA et al.: STUDY OF 0.25-m VSP

507

Fig. 4. Relative positioning of crossbar and cluster layouts.

the PMOS transistors constituting each multiplexer. The input drivers and the inverter buffer at the output of each multiplexer comprise strong transistors Wns and Wps ; where Wps = 2Wns : The widths of these strong transistors are varied in order to understand the impact of large drivers on the crossbar delay. The simulation is performed as shown in Fig. 6. From Tables I–VI, we make certain other key observations. Cycle times under 1.0 ns can be supported with up to 16 ports, but the delay increases quickly to about 1.5–2.0 ns at 32 ports and about 2.5–5.0 ns at 64 ports. Once the crossbar becomes the critical path, additional ports to support increased parallelism barely compensate for the loss in cycle time. The area requirements for the crossbar are relatively insensitive to the transistor sizes within the range of interest. The crossbars under 32 ports require very little area for a key central architectural structure; in fact, they may be so small that additional routing, as shown in Fig. 7, is probably required to connect to the functional-unit clusters. In case of additional routings, the additional delay would need to be added into the crossbar delay, a functional-unit output stage, or an additional pipeline stage. Even though we have neglected this routing delay for our initial design considerations, we have performed a number of simulation experiments in order to have an idea

of the wiring delay; the results of this experiment, however, are not included due to page-space limitations. Another point worth noting is that the network delay decreases only up to a certain point as the driver sizes are increased. Beyond this, the network delay often increases as the driver sizes are increased. This is due to the increased capacitive loading contributed to by the drivers themselves. 3) Verification of Crossbar Simulation Results: To demonstrate the accuracy of our simulation model and methodology, we show our layout design of a multiplexer-based eight-port crossbar switch in Fig. 85 and compare in Table VII the actual layout simulation results with numbers obtained from our simulation model. C. Modeling and Design of Local Register File Although the crossbar network fulfills the need for global interconnect, there is still the need for operand storage. According to our design, local multiported register files are needed in each functional-unit cluster to supply operands to the arithmetic units, and the overall performance of the VLIW architecture is directly related to the total operand 5 Note that this particular layout corresponds to a crossbar design that is not folded.

508

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

Fig. 5. Folded version of a multiplexer-based 2-bit eight-port crossbar network.

DELAY (ns)

FOR

TABLE I 8-BIT CROSSBAR SWITCHES

bandwidth from these register files. If operand bandwidth were the only concern, then many small register files would be optimal; however, there are other considerations. Since there is a practical limit on the number of nodes (clusters) on the interconnect network, there is some motivation to increase the power of each node by allowing multiple operations per cycle.

AREA (mm2 )

TABLE II 8-BIT CROSSBAR SWITCHES

FOR

A multiported register file in each cluster can act as fast local interconnect, reducing data-transfer latency between adjacent FU’s in the cluster. VLIW architectures are usually designed using simple RISC-style FU’s that need two operands from and write one result back into the register file in each cycle. This requires three ports per FU. Additional operations per cycle

DUTTA et al.: STUDY OF 0.25-m VSP

DELAY (ns)

AREA (mm2 )

TABLE III 16-BIT CROSSBAR SWITCHES

FOR

509

AREA (mm2 )

FOR

TABLE VI 32-BIT CROSSBAR SWITCHES

TABLE IV 16-BIT CROSSBAR SWITCHES

FOR

Fig. 6. Simulation experiment for modified crossbar design.

DELAY (ns)

FOR

TABLE V 32-BIT CROSSBAR SWITCHES

Fig. 7. Additional routing for cluster-to-network connections.

can be supported by increasing the number of ports on the register file in multiples of three. We have, therefore, simulated register files with 3, 6, 9, and 12 ports so that one, two, three, or four FU’s can be supported in each cluster. Before elaborating on our register-file simulation methodology, we would like to touch upon some of the important aspects of memory design, and explain the symbols that we use later in our discussions. Our basic design of a one-ported register file is illustrated in Fig. 9. An M 2 N -bit memory comprises M horizontal rows and N vertical columns of identical memory cells. Only one memory cell is shown in the figure. The memory cell comprises two cross-coupled inverters. Wnc and Wpc denote, respectively, the NMOS and the PMOS transistor sizes in the memory cell. For a singleported memory, each memory cell is connected to one bit line on each side through an NMOS access transistor. The size of this access transistor is denoted by Wnm : Note that for a pported memory, there are p access transistors connecting each memory cell to p bit lines on each side of the memory cell. All bit lines are precharged using weak NMOS transistors;

these precharge transistors are always conducting. Each pair of bit lines—the bit and the bit lines, corresponding to a specific port, on two sides of each memory cell—has its own write circuitry and sense amplifier. The write circuitry and the sense amplifiers consist of strong transistors whose widths are denoted by Wns and Wps (= 2Wns): Similarly, the row-select, column-select, and write-signal lines are driven by strong buffers, and the sizes of the transistors constituting these buffers are denoted by Wns and Wps (= 2Wns): For a large, multiported memory, the conventional design wisdom is to use minimum sized access transistors (i.e., the minimum Wnm allowed by the technology), and to design the memory cell with the cell transistors (Wpc and Wnc) having the minimum sizes that ensure correct operation by providing the required sourcing and sinking capabilities for the bit-line current. Once we have a correctly functioning memory, there are now several ways to improve the delay performance of the memory: 1) better sensing scheme (i.e., better design of the sense amplifier) and/or wider transistors for the sense amplifier; 2) stronger line drivers for the rowselect and the column-select signals; and 3) larger memory-cell transistors. The first two are easy to understand—since the

510

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

Fig. 8. Layout of a single-bit eight-port crossbar switch.

memory-access delay for a memory read is measured as the time from the memory address appearing at the input of the row-column decoder to the correct cell content(s) appearing on the bit line(s) at the output of the sense amplifier(s), larger transistors in the address decoders, line drivers, and sense amplifiers help achieve better delay characteristics. Increasing the sizes of the memory-cell transistors (i.e., the sizes of the transistors in the cross-coupled inverters in the memory cell) also helps because this allows the transistors to sink(source) more current, and so the bit line discharges(charges) faster through the cell transistor and the access transistor (which get connected in series). One, therefore, sees an improvement in the performance of the memory (faster discharging/charging of the bit line) as one increases the sizes of the cell transistors, starting with the minimum size. At some point, however, this improvement saturates, and increasing the sizes of the cell transistors does not help any more. For conventional applications that require large memories, the first two methods are more popular because increasing the cell-transistor sizes has a severe impact on area, even though this may improve performance. For video applications, however, performance is extremely important, and so, one is probably justified in experimenting with cell-transistor sizes, even for multi-ported register files and large frame memories. When considering the register file, therefore, two different types of simulations, as described next, have been considered. Case 1—Varying Cell Transistors: In this scheme, the widths of the NMOS access transistors (Wnm ) and the memory cells’ PMOS transistors (Wpc ) are kept constant at 0.8 m, and the widths of the memory cells’ NMOS transistors (Wnc ) are varied. The widths of the NMOS and the PMOS transistors, denoted by Wns and Wps ; respectively, in the word-line decoders (drivers), the sense amplifier, and the write circuitry are calculated as Wns = Wnc ; Wps = 2Wns : Case 2—Constant Ratio of Cell and Access Transistors: Here, both Wnc and Wnm are varied with their ratio Wnc =Wnm fixed. The widths of the NMOS and the PMOS transistors in the word-line decoders (drivers), the sense amplifier, and the write circuitry are calculated as before. Strictly speaking, the above-mentioned ratio cannot be kept

TABLE VII COMPARISON OF CROSSBAR SIMULATION MODEL

WITH

ACTUAL LAYOUT

fixed because of the stability problems discussed earlier, and so, we have kept the ratio constant over a range. Through a number of simulations, the piecewise-constant ratio is calculated as 1 ports 5: ratio = 3 6

( ( ( ports ( ( ports (

9:

ratio = 8

ratio = 13: Our simulations have revealed that the earlier method (Case 1) offers a wider range of area and delay values, and hence, a better scope of analyzing the video-processor–design tradeoffs; owing to limited page space, we have therefore presented in this paper only the results of this first experiment. Tables VIII and IX present the simulation results for the area and delay of a 16-bit register file for different numbers of ports, different numbers of registers, and different transistor sizes. Some entries in Table VIII are blank; the transistor sizes in these situations are not suitable for driving a memory with those many ports. The simulation is performed as shown in Fig. 10. 1) Interpretation of Register-File Simulation Results: From the delay table, we observe that the register-file delay is essentially insensitive to the number of ports (particularly when the number of registers is not too large), whereas the area requirements increase appreciably as a function of both number of ports and number of registers. The optimal design point is thus a complex issue. As more ports are added to each register file, the number of FU’s that can be supported in each cluster increases, reducing the number of clusters required. Furthermore, the efficiency of each FU is typically increased. This may compensate for the lower register density of the multiported register files. Unfortunately, when we increase the number of operations per cycle by supporting multiple FU’s using a multiported register file, the number of live register variables also increases, thereby requiring a larger number 10

12:

DUTTA et al.: STUDY OF 0.25-m VSP

511

Fig. 9. Basic design of a memory cell.

of registers in each register file. If enough registers are not available, temporary data must be stored in memory. Frequent accesses to local or external memory may have a significant impact on the cycle time. D. Local-Memory Modeling and Design Like the register file, local memory must be available in each cluster to store video data since global memory cannot provide

adequate bandwidth. Multiporting this local memory may also be helpful because multiported memories, in essence, provide additional conflict-free interconnect between sets of functional units as well as increased bandwidth into each individual memory bank. In a different analysis [13], we evaluated several alternate local memory configurations, including single-ported and multiported memories, in the context of a dedicated VSP module, and discovered that multiported memories with two,

512

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

DELAY (ns)

FOR

AREA (mm2 )

TABLE VIII 16-BIT MULTIPORTED LOCAL REGISTER FILES (CASE 1)

FOR

16-BIT

TABLE IX MULTIPORTED LOCAL REGISTER FILES (CASE 1)

four, and eight ports sometimes increased system performance significantly, despite increasing the cycle time. There are two important reasons to believe that multiported memories are as valuable for the design proposed here. First, the earlier design [13] used a multistage network which accounted for some of the resource conflicts that occurred when using single-ported memories. The crossbar network used in our present design will have fewer conflicts. Also, the earlier processor used very small memories, ranging from 16 to 144 bytes. While this was adequate for the application studied, we would like to use much larger memories on the VLIW VSP chip. The area penalty from multiporting these large memories is very severe; so, in order to maximize the amount of local memory available, we have restricted ourselves to five ports

Fig. 10.

Simulation experiment for register file (Case 1).

Fig. 11.

Simulation experiment for SRAM.

only in our simulation of the local SRAM’s. Furthermore, for density reasons, we have kept the sizes of the NMOS access transistors (Wnm ) at each port and PMOS transistors (Wpc ) in each memory cell at a minimum fixed size of 0.8 m. The widths of the NMOS transistors (Wnc ) in each memory cell are varied in order to understand the impact of large cell transistors on the performance of the memory. The widths of the NMOS and the PMOS transistors, denoted by Wns and Wps ; respectively, in the word-line decoders (drivers), the sense amplifier, and the write circuitry are kept at the following fixed values: Wns = 1:6 m, Wps = 3:2 m. The simulation of the local SRAM is performed as shown in Fig. 11. The delay and area simulation results, for varying number of memory ports, for different memory and transistor sizes, are presented in Tables X and XI. The tables being overly detailed, Figs. 12 and 13 help in understanding the effects of parameter variations by plotting the delay and the area results; the surfaces represent, from bottom to top, 2-, 8-, 32-, 128-, 512-, 2 K-, and 32K-byte memories. 1) Interpretation of Memory Simulation Results: From the results presented in Tables X and XI, it can be seen that even though larger transistors in the cross-coupled inverters in the memory cells improve performance, they also increase the area quite a bit; although the delay performance is excellent, the area penalty may be too severe for a VLIW design. Even the minimum-size design only allows only about 400 bytes.6 The common practice of double-buffering the data memory [13] further reduces available storage. Therefore, in order to provide a high-density alternative, we have followed some circuit design and layout tricks to improve the density of the single-ported and the two-ported memories—at equivalent access times to a more general design, a minimum size version of our high-density memory design allows about 2650 6 This is calculated from the data that, with W nc = 2:0 m, the area consumed by a four-ported 32K-byte SRAM is about 74.88 mm2 : Given that there are already two multiported interconnect structures on chip, any additional performance of multiported memory will not make up for the extra off-chip traffic caused by the fact that there is not enough on-chip memory.

DUTTA et al.: STUDY OF 0.25-m VSP

DELAY (ns)

TABLE X MULTIPORTED SRAM’S

FOR

bytes/mm2 of single-ported memory and over 2200 bytes/mm2 of double-ported memory. An interesting observation for one-ported and two-ported memories is that an increase in the sizes of the memory-cell transistors beyond a certain point (when the sizes are 8.0 m or more) degrades the delay performance. The reason, we believe, is that for a small number (one or two) of ports, the overall memory size is such that the loading capacitances contributed by the large cell-transistors outweigh the improvement due to faster discharging (or charging) of the bit line. Another interesting point worth noting is the effect of the number of ports on the delay performance of the memory—increasing the number of ports does not increase the delay of register file significantly, even though it does so for a 32K SRAM. The delay of a register file being so independent of the number of ports, a question may arise as to what prevents us from implementing a register file with many ports so that a large number of FU’s can be interconnected. The reason that such a solution is not feasible is that the area of the register file (or for any memory, for that matter) increases nonlinearly with the number of ports, and soon becomes

513

AREA (mm2 )

TABLE XI MULTIPORTED SRAM’S

FOR

TABLE XII COMPARISON OF MEMORY SIMULATION MODEL

WITH

ACTUAL LAYOUT

impractical for any system implementation. That is one of the main reasons that a register file with a large number of ports is not a feasible solution to memory bottleneck and memory access conflicts, even for a small number of registers. 2) Verification of Memory Simulation Results: To demonstrate the accuracy of our simulation model and methodology, we constructed a complete layout of the two-port 828-bit SRAM, and compare in Table XII the actual layout simulation results with numbers obtained from our simulation model. E. Design of Arithmetic Units The arithmetic units for the VLIW VSP are essentially the same as those for other types of signal-processing chips, with the provision that only small integer operations are

514

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

Fig. 12.

Plot of delay (ns) for multiported SRAM’s of sizes 2, 8, 32, 128, 512, 2K, and 32K.

Fig. 13.

Plot of area (mm2 ) for multiported SRAM’s of sizes 2, 8, 32, 128, 512, 2K, and 32K.

required. We have not performed detailed custom design of the computational components of the FU’s because we do not believe that they are likely to be the critical pipeline stages. This is supported by measurements from other published designs. A 32-bit ALU design in 0.25-m CMOS has been published by Suzuki [17] at 1.5 ns. This design requires only 0.6 mm2 . A full 54-bit multiplier requiring 12.8 mm2 in 0.25-m CMOS is described at 4.4 ns in a paper by Ohkubo [18]. This is far more complex than what is justified in a video signal processor, but the timing of each level of logic has been measured for this design and, based on that data, we believe that an 8-bit multiplier can be built under 2

mm2 and should perform much faster. A two-cycle pipelined 16216-bit multiplier is also a viable option. VSP functional units are slightly different from standard RISC functional units; specialized operations such as saturation arithmetic and magnitude-difference operations are important, but have minimal impact on cycle time. IV. EVALUATION

OF

ARCHITECTURAL TRADEOFFS

High-performance processor chips in a 0.25-m technology will likely be in the 200-400 mm2 range. This is supported by the fact that some of the recent microprocessor designs have die areas in that range. For example, Ikumi [19] has

DUTTA et al.: STUDY OF 0.25-m VSP

reported the design of a superscalar microprocessor with a die area of 17.34 2 17.3 mm2 in a 0.5-m CMOS technology. Another design for a 64-bit microprocessor with a die area of 17.7217.8 mm2 has been reported by Charnas [20]. This latter design is also implemented in a 0.5-m technology. Since we are targeting a more advanced process technology (0.25-m), it is reasonable to assume an even greater die size because we know that the chip area tends to increase from one generation of IC technology to the next [21]. This technology and die-size target make for an architecture substantially more complex than that of the current-generation video signal processors. The number of transistors available also give us a significant range of microarchitectural alternatives which is interesting to explore. In a VLIW architecture, about 40%–60% of the area would probably be devoted to the data path, the communication network, and the local memory, leaving the remaining area for modules such as the instruction memory, the instruction register, the branch unit, the program counter, the bypass circuitry, the on-chip control logic, and interconnections. This provides a wide range of choices for a video processor design. This section evaluates tradeoffs among data path, communication, and local memory based on the architectural data presented here. Let us assume a die size of 20:0 2 20:0 = 400 mm2 for our target chip, where 50% of the area will be devoted to the FU’s and the interconnection network.7 Interesting alternatives now exist at several points in the design space.  Peak performance is desired. If peak performance is the most critical issue, then high parallelism and fast cycle times can be emphasized. Let us suppose that we are interested in a 1-GHz, 16-bit processor design. Since we have decided in favor of the folded crossbar for providing the global interconnection, we see from Table III that a 16-port network is the best choice. The area of such a network is about 2.6 mm2 , leaving (200 0 2:6) = 197:4 mm2 for the FU clusters. A 16-port network can support a maximum of 16 clusters, which imposes an area constraint of (197:4=16) = 12:34 mm2 for each cluster. From a literature survey of published data-path-unit designs, as outlined in Section III-E, we make some reasonable guesses, and assume that a 16-bit ALU can be built in about 0.4 mm2 , an 828 bit multiplier will consume about 2.0 mm2 , and a 16-bit barrel shifter can be built in about 1.0 mm2: All of these modules, we believe, can be made to operate with a 1 ns cycle time with suitable pipelining. If we assume homogeneous8 FU’s, with all three arithmetic units described above in each, then each FU consumes (0:4 + 2:0 + 1:0) = 3:4 mm2 : Let us assume that we will employ two FU’s in each cluster (so that we can have 32 FU’s in total working in parallel in the 16 clusters, performing 32 operations per cycle). The presence of two FU’s demand, according to our design, the presence of 7 Note that devoting more or less area to the FU’s opens up further opportunities to consider additional design tradeoffs. 8 Note that one can also consider the design tradeoffs prompted by the presence of heterogeneous FU’s where different combinations of arithmetic units feature in different FU’s. We, however, restrict ourselves to homogeneous FU’s only.

515

a six-ported local register file in each cluster. We observe from Tables VIII and IX, that we can implement a six-ported 64register register file (minimum sized), with an access time less than 1 ns, in about 0.61 mm2. Note that increasing the number of register ports to more than six and meeting the timing requirements by increasing the transistor sizes may not make sense because, with only 64 registers and more than two FU’s, we may very easily run out of active registers. Two FU’s and a six-ported register file consume (3:4 2 2 + 0:61) = 7:41 mm2 ; leaving (12:34 0 7:41) = 4:93 mm2 for the local SRAM. Let us suppose that out of the 4.93 mm2 available, we will use about 4.5 mm2 for implementing the local memory (leaving the rest for interconnections). From Table X, we see that we can have about 128 bytes of a three-ported SRAM at less than 1 ns and such a memory consumes about 0.27 mm2 (Table XI). This indicates a design with b(4:5=0:27)c = 16 memory banks. If double buffering is necessary, the local memory can be organized as eight double-buffered banks, where the size of each bank is 128 bytes. Based on the above design, we can conceive of a VLIW architecture that features a 16-bit crossbar network providing global interconnection among 16 functional-unit clusters; each cluster features two FU’s connected by a 64-register, sixported register file and eight banks of double-buffered local SRAM’s (for storing video data). Fig. 14 shows some of the details of the data-path design for such an architecture. As may be obvious, the architecture is based on a four-stage pipeline: 1) instruction fetch; 2) register access/decode; 3) instruction execute; 4) memory write-back (into the register file). Note that, in the figure, we have clarified all seven connections to the register file using a separate arrow for each connection; in reality, however, the register file is six-ported, and the input from the crossbar network shares one of the six ports with an FU. Following the steps outlined above for an example design of a VLIW VSP data path, one can come up with alternative designs with different numbers of functional units (in each cluster), different register-file sizes (in each FU), and/or different numbers and sizes of local SRAM’s (in each FU). Fig. 15 shows a number of these alternative data-path designs.  Richer connectivity is desired. If flexibility and improved schedulability during compilation have a higher priority, then a design with richer connectivity is more suitable, even if it means that we have to sacrifice a little in terms of performance. Let us assume that we are prepared to pay a small penalty in performance, and settle with a clock-cycle time of 1.5 ns. From Table VIII, we see that it is possible to build a 12-ported 256-register register file with an access time just below 1.5 ns; the area of such a large register file is about 6.0 mm2 (Table IX). A 12-ported register file can support four FU’s, each of which, as before, is about 3.4 mm2 in area. Since it is helpful to have large local SRAM’s, we find, from Table X, that single-ported, 2K-byte SRAM’s can be implemented with access times less than 1.5 ns. Let us assume that we need 8K bytes of memory in each cluster, indicating

516

Fig. 14.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

Functional-unit data path.

that the memory is organized as four banks. From Table XI, we find that four 2K-byte memory banks will consume about (4 2 0:81) = 3:24 mm2 of chip area. This makes the area of each cluster equal to (6:0 + 4 2 3:4 + 3:24) = 22:84 mm2 : Now, we note from Table III that it is possible to implement a 32-port crossbar with a cycle time around 1.5 ns. This 32-port network has an area of 11.68 mm2 ; indicating that we can have b(200 0 11:68=22:84)c = 8 functional-unit clusters. The design, therefore, features a crossbar network with 32 ports, and there are eight functional-unit clusters, each consuming four network ports, as opposed to a single cluster on each port (as discussed before). The example design is shown in Fig. 16. Once again, a number of tradeoffs are possible, depending on

the clock-cycle time, the number of ports on the register file, the size of the local memory, and the number of ports on the crossbar network. V. CONCLUSIONS

AND

FUTURE DIRECTIONS

Digital video applications are clearly one of the most demanding computational tasks. As it becomes feasible to execute real-time VSP tasks on programmable architectures, more applications will be developed. Single-chip, programmable video processors are possible in the near future, but numerous architectural questions need to be answered, and programming tools need to be developed. VLIW has enormous potential in this domain to provide high performance and a great deal of

DUTTA et al.: STUDY OF 0.25-m VSP

Fig. 15.

517

Illustration of data-path design tradeoffs.

flexibility in application design; however, the technological challenges are difficult. The architecture will have more functional units and operate at much higher clock rates than any other VLIW design. In this paper, we outlined experiments to evaluate the VLSI design tradeoffs in building a wide, fast, VLIW VSP architecture. After analyzing the tradeoffs, we have come up with a number of novel designs (logic, circuit, and layout) for highspeed crossbar networks and multiported memory structures, and have developed highly parameterized C-based simulation

model generators that allow us to explore the VLIW-videoprocessor design space and study the area and delay tradeoffs defined by the characteristics of an experimental 0.25-m process. In evaluating the results of our experiments, we can draw several conclusions. • Crossbar switches are fast, small, and practical up to 32 ports. • Local register files are unlikely to be a performance or area bottleneck. They can, however, be area inefficient for a very large number of ports.

518

Fig. 16.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 8, NO. 4, AUGUST 1998

Data-path design with richer connectivity.

• Multiported data memories are large and slow. Banked global memory schemes are very complex, and can cause stalls. Local memory is fast and simple; it is limited more by density than performance as long as each memory bank is small. • ALU’s and other arithmetic units are not a major factor in the implementation. • Compiled-code simulations are required to determine utilization and effective performance. Note that, even though we have identified different microarchitectures, all of which are high performance and conform to the same cycle-time specifications, we need to perform compiled-code simulations in order to understand which of these designs perform well for real applications. To this end, efforts to develop an architectural simulator and a compiler, with specific scheduling mechanisms to match the architecture and VSP applications, are underway. The compiler efforts are based on the Stanford SUIF framework. We intend to continue refining our design of the VLIW VSP architecture, and future research in this direction includes experimenting with the following. • Alternative circuit designs, e.g., designing the multiplexer cells in the crossbar network using AOI (and–or–invert) logic gates, different implementations (simple gates versus complex gate) of an AOI gate, etc. • More metal layers to increase layout efficiency. • Innovative clock-distribution schemes that minimize propagation delay and skew.

• Four-transistor pseudo-DRAM’s (as opposed to SRAM’s) as storage elements to improve both the speed and density of the register file and the local memory. • More accurate simulation models that take into account interconnect delays and crosstalk capacitances. • Actual simulations of data-path elements for better estimation of their area and delay. • Functional-unit clusters incorporating heterogeneous functional units. • Design of the control unit, based on area and delay curves. • Good power-estimation schemes and ways to minimize power dissipation. • An automated software framework to evaluate the architectural-design tradeoffs (as opposed to the present manual exploration of the design space). • Core-based system design where a VSP core is surrounded by mask-programmable gate arrays. REFERENCES [1] C. P. Feigel, “TI introduces four-processor DSP chip,” Microprocessor Rep., pp. 22–25, Mar. 1994. [2] C. M. Huizer et al., “A programmable 1400 MOPS video signal processor,” in Proc. IEEE Custom Integrated Circuits Conf., May 1989, pp. 24.3.1–24.3.4. [3] B. Case, “First trimedia chip boards PCI bus,” Microprocessor Rep., Nov. 1995. [4] Chromatics, “Mpact/3000 data sheet.” [Online]. Available WWW: http://www.mpact.com/tech/index.html [5] A. Wolfe and J. P. Shen, “A variable instruction stream extension to the VLIW architecture,” in Proc. 4th Int. Conf. Architectural Support for Programming Languages and Operating Syst., Apr. 1991, pp. 2–14.

DUTTA et al.: STUDY OF 0.25-m VSP

519

[6] J. A. Fisher, “Very long instruction word architectures and the ELI-512,” in Proc. 10th Annu. Int. Symp. Comput. Architecture, 1983, pp. 140–150. [7] R. P. Colwell et al., “A VLIW architecture for a trace scheduling compiler,” in Proc. 2nd Int. Conf. Architectural Support for Programming Languages and Operating Syst., 1987, pp. 180–192. [8] W. Maly et al., “Memory chip for 24-port global register file,” presented at the IEEE Custom Integrated Circuits Conf., May 1991. [9] K. Ebcioglu, “Some design ideas for a VLIW architecture for sequentialnatured software,” IBM Res. Rep., Apr. 1988. [10] J. Gray et al., “VIPER: A 25-MHz 100 MIPS peak VLIW microprocessor,” in Proc. IEEE Custom Integrated Circuits Conf., 1993, pp. 4.4.1–4.4.5. [11] J. Labrousse and G. Slavenburg, “A 50 MHz microprocessor with a VLIW architecture,” presented at the IEEE Int. Solid-State Circuits Conf., 1990. [12] J. A. Fisher and B. R. Rau, “Instruction-level parallel processing,” Science, vol. 253, pp. 1233–1241, Sept. 1991. [13] S. Dutta, W. Wolf, and A. Wolfe, “VLSI issues in memory-system design for video signal processors,” in IEEE Int. Conf. Comput. Design (ICCD), Oct. 1995, pp. 498–503. 256 [14] K. Choi and W. S. Adams, “VLSI implementation of a 256 crossbar interconnection network,” in Proc. Int. Parallel Processing Symp., Mar. 1992, pp. 289–293. [15] F. E. Barber et al., “A 64 17 nonblocking crosspoint switch,” in Proc. IEEE Int. Solid-State Circuits Conf., 1988, pp. 116–117 and 322. [16] M. Cooperman, A. Paige, and R. Sieber, “A single 64 16broadband switch,” in Proc. IEEE Int. Symp. Circuits Syst., 1989, pp. 230–233. [17] M. Suzuki et al., “A 1.5 ns, 32b CMOS ALU in double pass-transistor logic,” in Proc. IEEE Int. Solid-State Circuits Conf., 1993, pp. 90–91. [18] N. Ohkubo et al., “A 4.4 ns CMOS 54 54-b multiplier using passtransistor multiplexer,” in Proc. IEEE Custom Integrated Circuits Conf., 1994, pp. 26.4.1–26.4.4. [19] N. Ikumi et al., “A 300 MIPS, 300 MFLOPS four-issue CMOS superscalar microprocessor,” in Proc. IEEE Int. Solid-State Circuits Conf., 1994, pp. 204–205. [20] A. Charnas et al., “A 64 b microprocessor with multimedia support,” in Proc. IEEE Int. Solid-State Circuits Conf., 1995, pp. 178–179. [21] S. Dutta and W. Wolf, “Asymptotic limits of video signal processing architectures,” IEEE Trans. Circuits Syst. Video Technol., vol. 5, pp. 545–561, Dec. 1995.

2

2

2

2

Santanu Dutta (S’86–M’87) received the B.Tech. degree with honors in electronics and electrical communication engineering from the Indian Institute of Technology, Kharagpur, in 1987, the M.S. degree in electrical (computer) engineering from the University of Texas (UT), Austin, in 1990, the M.A. degree in engineering from Princeton University, Princeton, NJ, in 1994, and the Ph.D. degree in electrical (computer) engineering from Princeton University in 1996. His main research interests include the design of high-performance video signal processing architectures, circuit simulation and analysis, and design and synthesis of low-power digital systems. From 1987 to 1989, he was a full-time Research Staff member at the VLSI Design Laboratory of Texas Instruments (TI) Incorporated, Dallas, where he was involved in the research and development of CAD tools. As a student at UT Austin, on a leave of absence from TI from 1989 to 1991, his main focus was path-tracing algorithms for interconnect analysis. From 1991 to 1992, he worked part-time at Ross Technology Inc., where his primary responsibility was as a Circuit Designer and a Layout Engineer. From September 1992 to August 1996, he was a doctoral student at Princeton University, working on the design and analysis of video signal processing systems. During this time, he spent a year at AT&T Bell Laboratories, investigating the impact of deep-submicron VLSI techniques on architectural/system design. Since August 1996, he has been working as a VLSI (circuit and logic) designer, at Philips Semiconductors in the TriMedia Group, where he is currently a Senior Design Engineer.

Kevin J. O’Connor (S’72–M’75–SM’91) was born on September 26, 1951. He received the B.E.E. degree from General Motors Institute, Flint, MI, in 1974, and the M.S.E.E. degree from Stanford University, Stanford, CA, in 1975. From 1975 to 1978, he was on the Technical Staff at Hughes Aircraft Company, Culver City, CA, where he was responsible for designing ECL gate arrays and investigating applications of GaAs technology. He has been with Bell Laboratories as a Member of the Technical Staff since 1978, when he joined the Memory Design Department, Allentown, PA. His early work for AT&T involved high-speed bipolar Schottky/ECL oxide isolated SRAM circuit and device design in the 4 kbit era. He subsequently designed one of the first submicron device SRAM’s, using 1 m NMOS technology which achieved a 5 ns access time. He was one of the principal developers of CRISP, the first RISC microprocessor developed within AT&T. He was responsible for the design of embedded cache memories in the processor, which was reported at the 1987 ISSCC. In 1988, he joined the ASIC Design Laboratory supporting cell libraries, macrocell generators, and hardware simulators. In 1992, he joined the Research Staff of AT&T (now Lucent Technologies), Murray Hill, NJ, in the Silicon Electronics Laboratory. His research interests are in area of deep submicron CMOS devices and circuits, and include multichip module technology and its potential for ground-breaking applications. Mr. O’Connor is a past General Chairman of the Symposium on VLSI Circuits. He has played an active role with the IEEE through talks and papers at various conferences and workshops, and has been an active member of the ISSCC Program Committee for many years. He has been a Guest Editor for several issues of the IEEE JOURNAL OF SOLID-STATE CIRCUITS. He was the IEDM Short Course Vice-Chairman and Chairman in 1989 and 1990, and is currently an AdCom member of both the IEEE Electron Devices Society and the IEEE Solid-State Circuits Society.

Wayne Wolf (S’78–M’83–SM’91–F’98) received the B.S., M.S., and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, in 1980, 1981, and 1984, respectively. He was with AT&T Bell Laboratories, Murray Hill, NJ. He is presently an Associate Professor of Electical Engineering at Princeton University, Princeton, NJ. His research interests include computer-aided design for VLSI and embedded systems, and application-driven architecture design, especially video and multimedia computing systems. Dr. Wolf is a member of Phi Beta Kappa and Tau Beta Pi, and also a member of the Association for Computing Machinery.

Andrew Wolfe (S’86–M’90) received the B.S.E.E. degree from The Johns Hopkins University in 1985, and the M.S. and Ph.D. degrees from Carnegie Mellon University, Pittsburgh, PA, in electrical and computer engineering in 1987 and 1992, respectively. He was an Assistant Professor at Princeton University from 1991 to 1997. He is now Director of Technology and S3 Fellow at S3, Inc., Santa Clara, CA. At Princeton, his interests included computer architecture, video signal processing, and embedded computing. Dr. Wolfe received the Burroughs Fellowship in 1985, and was an SRC Fellow from 1986 to 1991. He received the Walter C. Johnson Award for Teaching Excellence in 1995 and 1997, and the Engineering Council Teaching Award in 1996 at Princeton.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.