A 64-b microprocessor with multimedia support

August 27, 2017 | Autor: Ken Shin | Categoria: Image Processing, Video Compression, Design Methodology, Solid State Devices and Circuits, System performance, Electrical And Electronic Engineering, Power Dissipation, Texture Mapping, Register File, Electrical And Electronic Engineering, Power Dissipation, Texture Mapping, Register File

Share Embed

Denunciar este link

Descrição do Produto

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995

1227

A 64-b Microprocessor with Multimedia Support Lavi A. Lev, Member, IEEE, Andy Charnas, Marc Tremblay, Member, IEEE, Alexander R. Dalal, Member, IEEE, Bruce A. Frederick, Member, IEEE, Chakra R. Srivatsa, Member, ZEEE, David Greenhill, Dennis L. Wendell, Member, IEEE, Duy Dinh Pham, Member, IEEE, Eric Anderson, Hemraj K. Hingarh, Member, ZEEE, Inayat Razzack, James M. Kaku, Ken Shin, Marc E. Levitt, Member, IEEE, Michael Allen, Member, ZEEE, Philip A. Ferolito, Richard L. Bartolotti, Robert K. Yu, Member, ZEEE, Ronald J . Melanson, Member, IEEE, Shailesh I. Shah, Member, IEEE, Sophie Nguyen, Sundari S. Mitra, Member, IEEE, Vinita Reddy, Member, IEEE, Vidyasagar Ganesan, Member, IEEE, and Willem J. de Lange

Abstract-A 167 MHz 64 b VLSI CPU chip is described. The chip executes a 333-MFLOPS (peak) with an estimated system performance of 270SPECint92/380SPECfp92 ((12167 MHz, 2 MB E-cache). The 17.7 x 17.8 mm die is fabricated with a 0.5 micron CMOS technology with four metal layers and contains 5.2 M transistors. The superscalar processor is capable of sustaining an execution rate of four instructions per cycle even in the presence of conditional branches and cache misses. Four fully pipelined 8 x 16 b multipliers and four single-cycle latency 16 b adders combine to speed up image processing, 2-D, 3-D graphics, video compressioddecompressionby up to an order of magnitude. High clock speed was obtained by the use of delayed reset logic, a new register file design, and novel comparators. Strict design methodology allowed fully functional first silicon which met all speed targets. The power dissipation of the chip is 28 W.

64-b SPARC -V9 with multimedia instruction extensions 4-issue superscalar 9 stage pipeline 167 MHz 270 SPECint92 380 SPECfp92 3.3 v 28 W 20 mW sleep mode 17.7 mm=7.8 mm= 315 mm 5.2 million (3.4 million logic) 323 signal; 291 power/ground=614

Architecture Pipeline Clock freq Performance @167 MHz, 4MB E$ Power Supply Power @3.3 V, 167 MHz Die Size Transistors Die Pads

I. INTRODUCTION

AQUAD

issue microprocessor chip implements a 64 b architecture extension (SPARC V9) to a popular 32 b RISC instruction set (SPARC V8). Additional instructions and dedicated hardware provide up to 10%speedup of image processing and rendering algorithms including video compressioddecompression and texture mapped 3-D triangles. The chip contains 5.2 million drawn transistors on a 17.7 x 17.8 mm die fabricated with a 0.5 p CMOS technology with four metal layers (Tables 1-111). The package is a 520 pin plastic BGA with 187 power and ground pins. Operating at 167 MHz, it dissipates 28 W from a 3.3 V supply. An instruction prefetch unit contains a two-way 16 kB instruction cache, 64-entry fully associative instruction TLB, and a dual-ported next field RAM. The next field RAM contains two branch predictors, a next cache line index and set prediction bit for every four instructions in the instruction cache. Prefetched instructions are placed in a 12-entry instruction buffer and 44 predecode bits are added to each instruction to simplify instruction grouping. Grouping logic performs in order dispatch of 0, 1, 2, 3, or 4 instructions per clock, using a custom dynamic block containing 250 comparators, and a standard cell block containing 68 k transistors. Separate fully pipelined 3-cycle latency floating-point multiplier and adder and a nonpipelined dividerhquare root are complemented by four 8 x 16 b multipliers and four 16-b adders to form the Manuscript received May 5, 1995; revised September 5, 1995. The authors are with Sun Microsystems, Mountain View, CA 94043 USA. IEEE Log Number 9415590.

nnrlq

Package Average gatedpath Average % of path delay taken by gates Average % of path delay taken by gates (worst 10 paths on

520 pin plastic Ball Grid Array (BGA) 13 72% 84%

TABLE I1 PROCESS CORNERS AND SPEEDUP EXAMPLE PROCESS CORNER SPEEDUP REQUIRED PMOS NMOS SUPPLY VOLTAGE TEMP

I

I

I voltaae I

I

I

floating-point and graphics unit (FGU). When operating on fixed-point data, the processor is capable of exploiting its four fixed point adders and four fixed point multipliers, in addition to two other instructions, to achieve ten operations per cycle or 1.67 billion operations per second (BOPS) at 167 MHz. In order to put the circuits in the proper context, we first present the microarchitecture of UltraSPARC, this includes the PrefetchDispatch Unit (Section 111-B), the Integer Execution Unit (Section 111-C) the Floating-point/Graphics Unit (Section 111-D), the Load/Store Unit (Section 111-E), the Memory Management Unit (Section 111-F), the External Cache (Section 111-G), and the System Interface (Section 111-H). Section

0018-9200/95$04.00 0 1995 IEEE

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995

1228

TABLE I11 PROCESSCHARACTERISTICS PROCESS n-WELL CMOS

Tox

8.5 nm

Metal 1.2

1.6 ~.lm pitch; 55 “d

Metal 3

1.8 pm pitch; 55 mS1D

Metal 4

4.0 p pit& 24 mQD

Integer’Pipe

Floating-pointlGraphicsPipe Fig. 2. Pipeline of UltraSPARC. D....................................,..........~.....~.....

.. .

I

I

. I

I I

I

1

.

Fig. 1. Overall block diagram of UltraSPARC

IV covers general circuit techniques such as Delayed Reset Logic (Section IV-B) but also some specific implementations (comparators in Section IV-C and the register file in Section IV-D). Techniques used for the power network implementation and verification are described in (Section V). 11. MICROARCHITECTURE A. Goals and Directions

The overall architecture of UltraSPARC was designed with the following goal in mind: “the processor should be capable of sustaining an execution rate of four instructions per cycle even in the presence of conditional branches and cache misses at a high clock rate.” A simplified block diagram of UltraSPARC is shown in Fig. 1. The front-end of UltraSPARC is responsible to keep the nine functional units busy by prefetching instructions based on a dynamic branch prediction mechanism and based on a next field address that allows single cycle branch following. As long as branches are predicted correctly (typically the case for nine out of ten branches), the front-end can supply four instructions per cycle to the core execution block.

In order to reach the target cycle time on UltraSPARC-I (the first member of the family), a simple in-order-execution, out-of-order completion model was chosen. Instructions are executed as they appear in the code but long latency instructions and nondeterministic operations are allowed to finish out-of-order. For example, floating-point divide and squareroot instructions as well as loads missing in the data cache, are allowed to complete out-of-order with respect to other integer and/or floating-point instructions. A lengthy investigation of a more complex out-of-order execution engine was rejected because of its impact on both the cycle time and schedule [l]. The back-end of UltraSPARC addresses the main bottleneck encountered by modem processors, namely processor stalls due to data starvation. The nonblocking data cache, the load buffer and store buffer are organized in a way that makes the second-level cache appear as if it was on-chip. Full throughput of one load or store per cycle is provided to the external cache (Section III-E). Providing access to large cache (up to 4 megabytes) in every cycle can eliminate a significant portion of stalls due to load misses. A pipeline diagram is shown in Fig. 2, where each stage is briefly described. The conceptual pipeline model is simple; once dispatched, a group of instructions percolates down the pipeline and updates the register file (integer or floating-point) synchronously. Contrary to some existing SPARC processors (e.g., SuperSPARC, microSPARC) there is no floating-point queue holding floating-point instructions and decoupling the integer unit from the floating-point unit, a common source of functional bugs, and no complicated reorder buffer for guaranteeing precise exceptions. The integer pipe is simply stretched so that it matches the length of the floating-point queue. This is accomplished through a novel structure called the completion unit which does not increase the number of feedthroughs in the integer datapaths and thus preserves the cycle time while keeping area impact to a minimum. B. PrefetcWDispatch Unit (PDU) Instructions are prefetched from a pseudo two-way 16 kilobyte instruction cache. Each line in the I-cache contains eight instructions (32 bytes). Every pair of instructions has a 2-b branch prediction field which maintains history of a

LEV et al.: A 64-b MICROPROCESSOR WITH MULTIMEDIA SUPPORT

possible branch in the pair. The four prediction states are the conventional strongly taken, likely taken, strongly not-taken and likely taken [3]. The advantage of the in-cache prediction scheme is that it avoids the alias problems [2] encountered in branch history buffer and other similar structures. Every single branch in the I-cache has its dedicated prediction bits (ignoring the rare case of branch couples), which translates into a successful prediction rate of 88% for integer code, 94% for floating-point (SPEC92) and 90% for typical database applications. Every group of four instructions in the cache has a “next field” which is simply a pointer to where the prefetcher should access instructions for the very next cycle. In the case of sequential code or for code with a branch predicted not-taken, the next field points to the next four instructions in the cache. The next field will contain the I-cache index (including the set) of the branch target if a branch is predicted taken. The advantage of this scheme is that the next field can always be fed back to the I-cache without qualifying a possible branch. In order to provide a one-cycle loop back to the I-cache, a fast dual-ported structure was used to implement the next field and the branch prediction bits. Only one set of the cache is accessed during a fetch, saving power and reducing the cache cycle time. Both tags are read so that an incorrect set prediction can be corrected. A two-cycle penalty occurs for a set misprediction. The next field mechanism allows UltraSPARC to speculate five branches deep representing up to 18 instructions. Instructions prefetched by the PDU are expanded to 76 b in order to facilitate decoding done by the grouping logic. These decoded instructions are forwarded to a 12-deep instruction buffer which allows the prefetcher to get ahead of the execution units. As long as the instruction queue is kept almost full, cache miss, set miss and micro-TLB (pTLB) miss penalties can be hidden from the execution units. A single entry pTLB provides the prefetcher with a local copy of the last virtual-to-physical address translation. In the rare case of a pTLB miss l-cycle fetch penalty is incurred in order to get the address from the 64-entry fully associative instruction-TLB (ITLB). The grouping logic always looks at the next four candidates in the instruction buffer and based on resource availability and dependencies, issues up to four instructions. Maintaining more than one program counter (PC) per group allows UltraSPARC to dispatch, in the same group, instructions from two adjacent basic blocks. The logic and circuitry related to the groupingstage is covered in Section IV-C. C. Integer Execution Unit (IEU) The main responsibility of the integer execution unit (IEU) is to perform the computation for all integer arithmetic/logical

operations. Dual 64-b adders implemented in dynamic circuitry, an inverter and very little extra logic (muxes for immediate bypasses) form the basic cycle time of the machine (together with the data cache access). A separate 64-b adder is also provided for virtual address additions for memory instructions.

1229

A novel 3-D register file supporting seven read ports and three write ports, is described in detail in Section IV-D. A simple 64-b integer multiplier and divider complement the IEU. The multiplication unit implements a 2-b Booth encoding algorithm with an “early-out” mechanism, with a typical latency of eight clock cycles. A l-bit nonrestoring subtraction algorithm is used in the divide unit, which yields a latency of 67 clock cycles for a 64-b by 64-b division.

D. Floating-Point/Gruphics Unit (FGU) The floating-point and graphics unit (FGU) integrates five functional units and a 32 registers by 64 b register file (five read ports and three write ports). The floating-point adder [ 5 ] , multiplier and divider [6] perform all FP operations while the graphics adder and multiplier perform the graphics operations of the visual instruction set [4]. A maximum of two floating-point/graphics operations (FGops) and one FP loadhtore operation are executed in every cycle (plus another integer or branch instruction). All operations, except for divide and square-root, are fully pipelined. Divide and square-root operations complete outof-order without inhibiting the concurrent execution of other FGops. The two graphics units are both fully pipelined and perform operations on 8- or 16-b pixel components with 16or 32-b intermediate results. The Graphics Adder performs single cycle partitioned add and subtract, data alignment, merge, expand, and logical operations. Four 16-b adders are utilized and a custom shifter is implemented for byte concatenation and variable bytelength shifting. The Graphics Multiplier performs three cycle partitioned multiplication, compare, pack and pixel distance operations. Four 8 x 16 multipliers are utilized and a custom shifter is implemented. Eight 8-b pixel subtractions, absolute values, additions, and a final accumulation are performed for each pixel distance operation. E. LoacUStore Unit The load/store unit (LSU) executes all instructions that transfer data between the memory hierarchy and the two register files. The LSU includes the data cache, load buffer, store buffer, and is very closely coupled to the external cache (Fig. 3). 1 ) Datu Cache: The D-cache is a 16 kB,direct-mapped cache. It has a 32 B (256 bits) line size, with 16 B (128 b) sub-blocks. It is virtually-indexed and physically-tagged. The D-cache is nonblocking and operates using a write-through, no-write-allocate policy. Strict inclusion with respect to the Ecache is maintained, facilitating cache coherency. The D-cache data SRAM is single-ported and can support a 64-b load or a 64-b store every cycle. In the event of a D-cache miss, an entire sub-block (16 B) can be written in one clock. The Dcache tag SRAM has two ports: one read port (Port 1) and one read/write port (Port 2). Providing two ports allows a load or store to perform a tag look-up in parallel with the allocation for an older D-cache miss. 2 ) Load Buffer: The load buffer can eliminate stalls caused by D-cache misses, load-after-store hazards, and other con-

E E E JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995

1230

44

VA

Fig. 4. Overlap between tag access and data write for coherent writes. 6

IntegerFP Completion UnW

Fig. 3 . Loadstore unit.

flicts. Nine entries were implemented to cover the additional 6-cycle latency of a D-cache miss/E-cache hit. A rate of one load E-cache hit per cycle can be sustained. Early compiler results indicate that more than 50% (statically) of the loops in SPECfp92 are amenable to be software pipelined based on the E-cache latency. These loops represent an even larger component of the dynamic execution time. The load buffer is organized as a circular queue. Each load is enqueued with an indication of whether it hits or misses the D-cache and this information is tracked for the lifetime of the operation, even in the presence of snoops. An age-based, associative comparison is performed in order to “adjust” the raw D-cache hillmiss indicator of the incoming load to account for allocations or victimizations that may be performed by pending loads to that D-cache line. Thus, the D-cache tags are only checked once. 3) Store Buffer: The 8-entry Store Buffer (each entry accounts for a 64-b datum and its corresponding address) provides a temporary holding place for store operations until they can be “committed” and the D-cache and/or the E-cache is available. The E-cache update is a two-step process. First, the E-cache tags are checked for hidmiss. Then, the E-cache write occurs at some later time. The E-cache tag and data RAM accesses are decoupled so that a tag check can be going on in parallel with the E-cache data write of an older store, allowing us to maintain a throughput of one store per clock (Fig. 4). Additionally, consecutive stores to the same E-cache line (64 B) typically require only a single tag check, thus minimizing tag check transactions. Store compression combines the last two entries in the store buffer when they both write to the same 16B block. Any number of stores can be combined into one transaction. Hence, the number of data write transactions are minimized, an important concern since all stores must update the E-cache given that the D-cache is a write-through design.

F. Memory Management Unit The data MMU contains a fully associative, 64-entry translation lookaside buffer (TLB) which provides one virtual-to-

physical address translation per cycle. Any combination of the four page sizes supported, 8 kB,16 kB,512 kB, and 4 MB, is allowed. A TLB miss is handled by software for simplicity and flexibility. On the other hand, simple hardware assist is provided for speed. Two read-only registers contain pointers to translation table entries from the translation storage buffer (TSB), defined as a simple, direct-mapped software cache. A separate set of eight global registers is also accessible as temporary storage. The TLB, as well as the RAM structures, use a novel delayed-reset logic describe in Section IV-B.

G. Extemal Cache The external cache (E-cache) is used to service misses from the I-cache and D-cache. It is physically addressed and physically tagged. The line size is 64 bytes. E-cache sizes from 512 kB to 4 MB are supported. A 512 kB cache can be built with five 32 k x 36 6 ns synchronous SRAM’s. E-cache data is protected by byte parity. The SRAM has an internal delayed write buffer to minimize the write after read (WAR) penalty. Writes to the SRAM core are delayed until the next write arrives. This buffer is fully bypassed inside the SRAM. The additional latency for an internal cache m i s s and Ecache hit is six cycles (three internal and three external cycles). Reads can be completed every cycle, with data driven the second cycle after address and control signals. Coherent reads that hit the E-cache are represented in Fig. 5. UltraSPARC does not differentiate between burst reads and two consecutive reads; signals used for a single read are simply replicated for each subsequent read. The timing diagram shows three consecutive reads that hit the E-cache. The control signal (TOEL), the address for the tag read (ECAT), as well as the control signal (DOEL), and the address for the data (ECAD) are shown to transition shortly after the rising edge of the clock. Two cycles later, the data for both the tag read and data read is back at the pins of the CPU shortly before the next rising edge (meets set up time and clock skew). Notice that the reads are fully pipelined and thus full throughput is achieved. Writes can also be completed every cycle, with data driven the cycle after address and control. In Fig. 4, we show how the data for three previous writes (WO, W l , and W2) is written while three tag accesses (reads) are made for three younger stores (R3, R4, and R5). A dead cycle is created when switching direction on the data bus, to avoid overlapping drivers. The total write-after-read (WAR) penalty is two cycles. There is no read-after-write (RAW) penalty.

LEV et al.: A 64-b MICROPROCESSOR WITH MULTIMEDIA SUPPORT

1231

CLK

TSYN-WRL

R o ’

1 Fig. 5.

R I ,

R L ’

€CAD EDAT,\

Coherent reads hitting the external cache.

H. System Interface

The 144-b E-cache data bus is also connected to the two identical UltraSPARC data buffer (UDB) chips, which connect the processor to the system (Fig. 6). The UDB’s serve to electrically isolate the interaction between the CPU and Ecache from the system bus and operate at the system clock frequency, which can be either 112 or 113 of the processor clock. Collectively, the UDB’s have FIFO’s for eight 16-byte noncacheable stores, one 64-byte read buffer, two 64-byte write buffers, and a 64-byte copy-back buffer. The large number of outstanding 16-byte stores is useful for maintaining peak store bandwidth to a frame buffer. System transactions are packet based, in that address and data transfers are disjoint noninterfering events. A 36-b address bus is used to deliver two-cycle request packets that begin a transaction. This bus can be shared by up to three other masters, in addition to a centralized system controller. Arbitration is distributed. Each master on the address bus has the same logic and sees all requests for the bus. There are five potential requests: four potential masters plus one from a high priority system controller. Arbitration is roundrobin with a hysteresis effect to reduce latency for the last master. This helps reduce latency for bursts of transactions from the same master. There is also a special parking mode for uniprocessors that typically reduces arbitration latency to zero, by keeping UltraSPARC enabled onto the address bus between transactions. 111. CIRCUITIMPLEMENTATION A. Simulation Corners and Speedup Ratios

Large CPU projects are expected to be taped out many times using process technologies that might be different than the one the chip was originally designed with. This is done either during the productization of the chip, or during future compaction and shrink projects that will later leverage the work that was done on the original device. Therefore, it is of paramount importance to ensure that the whole chip will respond in a uniform way to future process variations. Circuits that are using exotic design style that will function at a very narrow design window must be identified and eliminated. In order to ensure this, the UltraSPARC circuit methodology introduced the concept of speedup ratio. Speedup Ratio is defined as the ratio between the speed of a circuit at a certain process and environment condition and its speed at a reference nominal corner. At first, twelve process and environment conditions, in which the original design

1

HGche Data Address

M

Fig. 6. UltraSPARC interfaces.

I Fig. 7. Speedup ratio distribution.

should function to spec, were defined. Then, an extensive set of circuits, implemented in various design styles, that are agreed upon as robust, where simulated and their performance across all process and environment conditions were recorded and compared to a reference nominal corner. The average ratio in which all these circuits sped up or slowed down, in comparison with the nominal corner was defined as the speedup ratio for the corresponding comer and was set to be a requirement for any circuit in the chip. Every circuit on the UltraSPARC project was required to change its performance across all process and environmental conditions, in the same way. Circuits that deviated by more than 10% from the required ratio were rejected. As an example, “os-Only pass gates were excluded from the design (with the exception of memory arrays which were extensively simulated) due to their poor relative performance at low supply voltages. Fig. 7 shows the deviations of UltraSPARC’s circuits from the required speedup ratios across all simulation comers. The graph represents the entire design. B. Delayed Reset Logic

Temporal pipelines have long been used to increase the throughput of combinational logic by breaking it into stages [13]. Self-timed pipelines [14], [15] allow these stages to be controlled locally instead of using a common clock, which can be advantageous because they can achieve a higher effective clock rate. The local control precharges each stage again

1232

as soon as the following stage has used taken the data, so self-timed pipelines have sometimes been called postcharge logic [7], or self-resetting logic [8] to emphasize this aspect. While the logic function blocks used in self-timed pipelines often resemble precharged domino logic [ 161 blocks, the selftimed pipeline control allows the blocks to be used with the advantages of a true temporal pipeline rather than waiting for the logic to propagate through all of the stages before precharging them all at the same time, as for ordinary domino logic. Moreover, the local control of self-timed pipelines does not present a large loading on a single clock as with ordinary domino logic, and can sometimes save power because of selective discharginglprecharging. We used a self-timed pipeline for the RAM structures to obtain these advantages, and chose a particular style we call “delayed-reset logic” [ 121. The reset is controlled by the previous stage and is propagated down the logic path. The recovery period is locally timed in each stage. The active logic state is static for an indefinite period waiting for the reset from the previous stage. The practical benefits include the potential to stop the logic in a state for debug. In addition, the reset controlling timing element (pulse generator) at the head of the pipeline can be more complex, for example compensating for temperature. Adjustment of circuit timing requirements is assisted by using independent controls at the first stage to control overlaps at the convergence points of critical races. Alternating n stages and p stages are used. Each stage has a reset input and output (rstin, rst-out) used to propagate the reset at dynamic logic speeds from stage to stage. The n stage quiescent condition begins with input nodes in-a and in-b in the low state and keeper transistors ( k inathe Fig. 8) holding nodes out and rst-out in the high state. Node r is in the high state keeping transistor p l turned off. Input rstin will be in the low state if originating from the previous stage or will be high if reset is generated from the input signals (see inverter in block diagram of Fig. 8). The forward propagation begins with the gates of transistors n l and n2 transitioning high (nand function), driving node out low and activating the next p stage. The inverter (inv) feedback from node out sets node f high and turns on transistor n4 which enables the gate to be reset. The inverter (inv) detects use of the gate in a forward propagation allowing selective precharge of the gate but does not control reset timing. The gate is static and holds the logic state until the reset period begins. The reset period begins in the pulse generator when the negative edge (in-b) activates the rst-in through an inverter initiating reset propagation from stage to stage. When reset in, rst-in, goes high, node T will be pulled low, since transistor n4 is enabling the reset stage. This stage, consisting of n3 and n4 with p2 pullup, is dynamic resulting in the propagation at the same rate as the forward propagation. The recovery period, to get ready for a new input, occurs when the out node is sufficiently high to trip inverter (inv) to the low state activating the node T to the high state by transistor p 2 at the trip point of p 2 and n4 which is a ratioed inverter in this state. With node r and f back in the quiescent state the gate is ready to receive new inputs. After the recovery period ends, a new piece of data may be entered into a string of stages acting as

IEEE JOURNAL OF SOLID-STATECIRCUITS,VOL. 30, NO. 11, NOVEMBER 1995

evaluae. recovery RSel

Fig. 8. Delayed reset logic.

a temporal pipeline before the previous data has propagated to the last stage. A critical design constraint is to time the local recovery which is controlled by the fanout (8 < fanout < 12), trip point of inverter (inv) and the time that node out takes returning to the high state. In the case of local reset generation (used periodically to allow single wire connections to remote gates), a requirement is to turn off n3 before n4 turns on in the forward propagation period. This is easily controlled by fanout in the opposite transition of inverter (inv). The pulse generator can be a Domino stage with a series n gate derived by a delayed clock. In this way, the logical active state can be controlled in time duration and the time in quiescent precharge determined by the reset time of the intrinsic stages. This Delayed-reset style is not strictly pulse driven in that the logical active period is static allowing the delay to be arbitrarily large, i.e., interrupting the delayed clock would hold the logical state. C. Gcomp The grouping stage in a superscalar processor is typically critical given that the number of instructions to be dispatched is dependent on many dynamic events. For instance, data dependencies have to be checked against all previous instructions still in the multiple pipes, also resource allocation within the group and across groups for nonpipelined units has to be done. To make matters even more difficult, aggressive grouping involving merging of nondispatched instructions from the previous group with the next group involves a singlecycle feedback signal indicating how many instructions were just dispatched. For UltraSPARC, predecoding from previous stages, and custom logic were used to fit the grouping logic within the cycle time.

LEV et al.: A 64-b MICROPROCESSOR WITH MULTIMEDIA SUPPORT

1233

rpsO

r--

Fig. 9. Cascode NOR gate.

A main building block in the Gcomp custom circuitry was the wide cascode NOR structure [lo], which was used in the static comparator’s circuitry Fig. 9. In a normal operation mode node ‘‘P” is at ZERO volt level. If any one of the XOR outputs IN1-INn is set to HIGH the node “13” is pulled toward the supply rail through the “N” devices M1-Mn. while cascode inverter “Inv2” is pulling down the node I7 which shuts transistor Mp and disconnect node 15 from 13 (Transistor “Mp” will shut whenever the voltage difference between the nodes “13” and “17” is smaller that its threshold voltage). Transistor Mpl is then charging the node 15 all the way to the supply level voltage while transistors TI to T,, are pulling node “Out” to a low level. When all the XOR outputs are at LOW state transistor “NI” which is a weak device will pull node “13” down. When the voltage level on node “13” is reduced, the inverter “Inv2” will start charging node “17.” The charging of node “17” coupled with the discharge of node “13” will turn on transistor ‘‘MP.’’ Whenever this transistor is conducting, node “15” charge shares with node “13” and rapidly flips the inverter INVl to produce HIGH logic level at the output. Inverter “INV3” provides a shut-down mode for these comparators. Whenever the comparators are not needed node “I”’is pulled HIGH and the whole structure does not consume any current. D. Integer Register File

The integer register file implements eight overlapped register windows and contains 160, 64-b registers. Thirty-two of the 160 are four sets of eight GLOBAL registers, which are not part of any register windows. Sixty-four are eight sets of eight LOCAL registers, which are unique to each register window. Sixty-four are eight sets of eight INIOUT registers, whose OUT registers overlap with IN registers of adjacent window. That is, OUT registers of current window (CWP) are addressable as IN registers of the adjacent window (CWP 4- 1). The register file supports seven read ports and three write ports for three instructions, one of the read ports is dedicated for store instructions. Single-ended read bit lines are used instead of differential bit lines to conserve area. Differential write bit lines are used for robustness. In any given cycle, only two register windows need to be accessed. One active window for all reads (occurring in the G-stage of the integer pipeline in Fig. 2), and a separate active

f j j i !

g

Fig. 10. Local register cell.

window for all writes (occurring in the W-stage of the integer pipeline in Fig. 2). The new register file cell design allows sharing of one set of read and write transistors across eight storage bits (one for each of the eight windows), thus saving transistor count and area. The local register cell implementing eight nonoverlapping windows is shown in Fig. 10. At the top of the figure, the seven signals [rpsO, rpsl,. . ., rps61 represent the seven read port select lines. These lines select register N and travel across the full width of the register (64 b). The decoded read current window pointers are represented by eight signals [rcwpO, rcwpl,. . ., rcwp71. Only one of these is active at any time. They select which window should provide register N . Cross-coupled inverters are used to store each bit. A local inverter isolates each bit cell to avoid charge sharing. The data read from the individual bit cell is then buffered through INV2 which contains a large pulldown transistor. The seven single-ended bit lines are precharged and therefore read access time is determined by read bit line pulldown speed. The three write port select lines are [wpsO, wpsl, wps21. Differential write is implemented so three pairs of write bit lines are needed [wdtOIwdcO, wdtllwdcl, wdt2lwdc21. As mentioned earlier, a separate window pointer is provided for the writes. Only one of the eight signals [wcwpo, wcwpl,. . ., wcwp71 is active at any time. The sense amplifier used in the single-ended read bit line sensing is shown in Fig. 11. At the beginning of the read cycle, the read bit lines are precharged and then conditionally discharged by the newly selected register bit cell. Since the bit line is highly capacitive and the read nmos pass transistor is relatively small, the bit line falls slowly. To maximize performance without compromising too much noise margin,

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995

1234

precharge

metal1

met&

metal3

hl1

b D

cs ate

memory

bt&

ROWS

Fig. 11. Single ended sense amplifier.

vss

V ........... +=* ................................................... I

Fig. 13. Power distribution network for control blocks.

Minimizing chip area implies using as little metal area as possible * Minimizing IR drop requires increased usage of the metal system. .................... As a result of these constraints, consideration must be .......................... given to power distribution at all phases of the design. On .................... UltraSPARC an automated design methodology is employed to implement a complete full-chip power distribution network [Ill. As a result, rapid identification of exact layout locations, with potential electromigration or excessive voltage drop Fig. 12. Sense amplifier waveforms. problems, was obtained. In order to minimize routing overhead, sections of the power the bit line is precharged to 600 mV above the trip point of distribution network are incorporated into the standard cell and the sense amplifier. data-path libraries. In the standard cell library Vdd and lines Fig. 12 shows the waveforms associated with the sense are distributed in Metal1 horizontally across each cell, so that amplifier. When the bit line drops below the sense amplifier the power rails of adjacent cells abut. In control blocks the trip point, the cascode transistor, N1, turns on and its drain, synthesis tool places Metal3 Vdd and V,,lines in parallel with bitout, drops very rapidly from VDD to the level of its source Metall, in the channels formed between adjacent rows. Metal2 due to charge sharing. This node drives a two inverter chain, connects the Metal1 and Metal3 through vertical connections INV3 and INV4, whose thresholds are skewed to favor the at the block level, see Fig. 13. falling output transition. Likewise, for the datapath library, Metalf and Metal2 Vdd This circuit will operate with maximum performance and and V,, lines are laid out inside the library cells; the latter no noise margin if the bitline is precharged to a level just horizontally and the former in both directions. Adjacent cells approaching the threshold of INV1, where the output of INVI, in a datapath row are abutting, and form one continuous power Csgate, equals the bitline voltage plus a threshold voltage, Vi, rail in Metal2. Metal3 power lines are routed vertically by the of N1. A slight reduction of the bitline voltage will turn on datapath compiler. Metal4 is not used for routing either inside N1 causing its drain to fall rapidly to its source level. In order the cells or inside the blocks composed of cells. Instead, this to introduce some noise margin, the bitline must be precharged interconnect layer is reserved for routing of global signals, high enough to cause the output of INV1, csgate, to drop below such as power, clock, and critical signals. the level of the bit line. To ensure this happens, N 2 and INV2 Global routing of power from the pads to individual funcalso provide a precharge path which will continue to precharge tional units is accomplished through the use of the 4th metal the bit line even after INVl has shut off N1 because INV2 layer. Metal4 power lines are routed at regular intervals across has a higher threshold than INV 1. The relative sizes of the two the entire length of the chip. A mesh is formed between inverters and associated transistors can be adjusted to establish these lines and the Metal3 lines that are connected in the desired noise margin. The clamp on the bitline is a leaker to blocks. This approach retains floorplanning flexibility without prevent the read bit-line from “drifting” up when reading a compromising the strict IR and electromigration requirements. logic one. IR drop and immunity to electromigration are simultaneously optimized at the cell, block and chip levels. This is done Iv. POWER NETWORKIMPLEMENTATION AND VERIFICATION through HSPICE simulations and PGRID analysis. The efficient distribution of the Vdd and networks on PGRID is based on a 3-D Poisson equation solver which large microprocessors is a complex problem, usually con- calculates current densities and potentials in a power network. strained by the following opposing requirements: It has the ability to graphically highlight possible electromigra................................

0

v,,

v,,

LEV er al.: A 64-b MICROPROCESSOR WITH MULTIMEDIA SUPPORT

1235

REFERENCES

Fig. 14. Chip microphotograph.

tion and IR violations for those geometries in a layout which do not meet the user-specified criteria. Performing analysis on the layout directly enables the user to detect violations down to the contact level. This level of resolution could not be attained through a schematic based approach. Power grid analysis is used at several levels in the design process: essentially whenever layout is available, including data path and control block cell development. The general flow is as follows: Determine current densities and voltage drops along each segment of the layout of the cell, block or the whole chip. Compare current and voltage values against user specified limits. Produce graphical error maps of all locations which violate electromigration or IR constraints. These error maps, which also contain the value of the violation, may then be overlaid onto the block layout. In this way, the exact locations of any violations can be immediately identified.

V. SUMMARY A 167 MHz superscalar processor capable of sustaining an execution rate of four instructions per cycle even when accessing data from a large external cache has been described. A strict design style combined with thorough circuit simulation and analysis at all process corners resulted in working first silicon at the predicted speed of 6 ns cycle time. Delayed reset logic was used for key components of the microarchitecture where extra latency would have hurt performance. Specid custom comparators were used to speed up a typically critical stage in superscalar processor, namely the grouping stage. In this way, generalized intragroup and intergroup instruction dispatch was possible in a single cycle.

[l] M. Tremblay, D. Greenley, and K. Normoyle, “The design of the micro-architecture of UltraSPARC,” to appear in Proc. IEEE, Nov. 1995. [2] B. Calder and D. Grunwald, “Next cache line and set prediction,” in Proc. 22nd Ann. Int. Symp. Computer Architecture, June 1995, vol. ISCA-22, pp. 287-297. [3] J. E. Smith, “A study of branch prediction strategies,” in Proc. 8th Ann. Int. Symp. Computer Architecture, 1981, pp. 135-148. [4] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, and G. B. Zyner, “The visual instruction set (VIS) in UltraSPARC,” in Proc. 1995 Compcon Con$, Mar. 1995. [5] R. K. Yu and G. B. Zyner, “167 MHz radix-4 floating point multiplier,” in Proc. 12th Symp. Computer Arithmetic, to be published July 1995. [6] J. A. Prabhu and G. B. Zyner, “167 MHz radix-8 divide and squareroot using overlapped radix-2 stages,” in Proc. 12th Symp. Computer Arithmetic, to be published July 1995. [7] R. Proebsting, “Speed enhancement technique for CMOS circuits,” U.S. Patent, 4,985,643, Jan, 15, 1991. [8] T. I. Chappel et al., “A 2 ns cycling, 4 ns access 512 kb CMOS ECL SRAM,” in ISSCC Dig. Tech. Papers, Feb. 1991, pp. 50-51. [9] A. Charnas et al.. “A 64 b microprocessor with multimedia support,” in ISSCC Dig. Tech. Papers, Feb. 1995, pp. 178-179. [lo] L. A. Lev, “Fast static cascode logic gate,” US.Patent 5,438,283. [ll] A. Dalal et al., “Design of an efficient power distribution network for the UltraSPARC-ITM microprocessor,” to be published in ICCD ’95, Con$ Proc.. [12] D. Wendell, “Reset logic circuit and method,” U S . Patent 5,430,399. [13] S. Anderson, J. Earle, R. Goldschmidt, and D. Powers, “The IBW360 model 91 floating point execution unit,” IBM J. Res. Dev., pp. 34-53, Jan. 1967. [14] I. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, pp. 720-738, 1989. [ 151 T. Williams, “Performance of iterative computation in self-timed rings,” Kluwer J. VLSI Signal Processing, vol. 7, pp. 17-31, Feb. 1994. [16] R. H. Krambeck, C. M. Lee, and H.-F. S. Law, “High-speed compact circuits with CMOS,” IEEEJ. Solid-state Circuits, vol. 17, pp. 614-619, June 1982.

Lavi A. Lev (M’89) received the B.S. degree in electrical engineering from the “Technion” Israel Institute of Technology, Haifa, Israel in 1984. He joined Sun Microsystems Inc., Mountain View, CA, in 1992, where he has been the lead circuit designer of the UltraSPARC processor. Prior to working at Sun, he was a lead circuit designer of the first PentiumTM microprocessor at Intel Corporation, Santa Clara, CA, SparcLiteTM at Fujitsu Corporation, San Jose, CA, and NSCG16, NS32532 processors at National Semiconductor Corporation, Santa Clara, CA. He is the author of numerous papers and patents related to VLSI circuit design and is an advisory board member of Cadence Design Systems, Inc., San Jose, CA.

Andy Charnas received the B.A. degree in history and the B.S. degree in electrical engineenng from the University of Pennsylvania, Philadelphia in 1976 and 1981 and the M.S. degree in electrical engineering from Stanford University, Stanford, CA, m 1982. From 1981 to 1984, he worked at AT&T Bell Laboratories, Murray Hill, NJ, in the microelectronics group where he was a designer on the Bellmac series of 32 b CMOS microprocessors In 1984, he Joined Intel Corporation, Santa Clara, CA, where he was a lead circuit designer on the 80386 and 80486 projects. In 1987, he joined Silicon Design Laboratories, San Jose, CA, worlung on custom IC designs and design tools. In 1988, he joined Sun Microsystems, Mountain View, CA, where he has been dolng clrcuit design, methodology and deslgn management on SPARC CPU’s.He was Manager of the custom circuit design group on the UltraSPARC processor and is currently the Project Manager of the UltraSpARC 11 processor,

1236

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995

Marc Tremblay (S’90-M’91) received the B.S. in physics from Lava1 University, Quebec, PQ, Canada, and the M.S. and Ph.D. degrees in computer science from the University of California, Los Angeles. He is a Computer Architect involved in the research and development of high-performance processors at Sun Microsystems, Mountain View, CA. He has been with the processor group at Sun since 1991. During that time, his main contributions have focused on the microarchitecture definition and uerformance evaluation of the 64-b UltraSPARC Processor for whch he has filed five patents. His current work relates to improving the synergy between the processor and the compiler and to including extensive multimedia capabilities directly onto the processor.

Alexander R. ))ala1 ( ~ 9 0 received ) the B,S degree in electrical englneenng from the university of Notre D ~ South ~ Bend, ~ IN, , and the M.S. degree in computer engineering from North carollna State University, Raleigh, in 1988 and 1990, respectively. From 1990 to 1992, he was with Intel corporation, where he worked on the crcuit design of the PentiumTM mcroprocessor. Since 1992, he has been with Sun Microsystems, Mountain View, CA, where he i s a circuit designer on the UltraSPARC mcroprocessor. On UltraSPARC, he was responsible for the design and verification of the global power distnbution network IGs research interests include low-power CMOS logic styles for microprocessor design and very low skew distribution schemes for clock networks. He has authored and coauthored several papers on microprocessor power distribution networks, yield prediction CAD software, and clock network extraction methodologies.

Bruce A. Frederick (M’93) received the B.S. degree in electrical engineering from the University of California at Berkeley in 1972. Since 1972, he has been working in the semiconductor industry specializing in high speed memory design and process development. He has worked on DRAM, SRAM, nonvolatile memory devices, and various process technologies. Since 1992, he has been with the Sparc Technology Division of Sun Microsystems Inc., Mountain View, CA, where he has been responsible for the development of the embedded cache mc:mories for UltraSPARC processors. He is currently working on the next gelieration UltraSPARC processors.

Chakra R. Srivatsa (Sf86-M’87) received the B Tech. degree in electncal engineering from the Indian Institute of Technology, Madras, India, and the M S . degree in electrical and computer engineering from the University of California, Santa Barbara, in 1985 and 1987, respectively. From 1987 to 1991, he worked at GigaBit Logic, Thousand Oaks, CA, on GaAs ASIC’s and custom chips. In 1991, he joined the SPARC Technology Business of Sun Microsystems, Mountain View, CA and has worked on the UltraSPARC microprocessor Now he manages a design team engaged in the UltraSPARC 11processor

David Greenhill received the B Sc degree in physics from Imperial College, London, UK From 1986 to 1992, he worked at Inmos L t d , U K on Transputer & Graphics chip designs In 1992, he joined Sun Microsystems, Mountain View, CA, to work on UltraSPARC Currently, he is Megacell Manager He has also worked on datapath methodology and the memory management unit design

Dennis L. Wendell (S’80-M’82) received the B.S.E.E. from the University of Nebraska, Lincoln, in 1979. He worked on the fully associative TLB used on the UltraSparc as well as some of the logic styles used; previously specializing in memory design of stand alone S W ’ s .

Duy Dinh Pham (M’94) received the B s degree in electrical engineering and computer science from the Universlty of California at Berkeley, in 1986 He then Joined BellCore in New Jersey, to work on design of the crosspoint switching chip used in the telephone switching networks He later worked at LSI Logic CorporatIon, where he focused in design of memory megacells and methodology development of memory compiler for the cell-based ASIC’s. Prior to that, he worked at Integrated Information Technology Company, designing X86-compatible mcroprocessor. Since 1992, he has been a member of the design team of the UltraSPARC processor at Sparc Technology Group

Eric Anderson received the B.S. degree In electrical engineering from the Massachusetts Institute of Technology, Cambridge, and the M.S. degree in electrical engineering from the University of California at Berkeley in 1989 and 1991, respectively. He joined Sun Microsystems, Mountain View, CA, in 1993,.where he has worked as a cache memory circuit designer.

Hemraj K. Hingarh (S’66-M’67) received the M.S.E.E degree from University of California at Berkeley, in 1968. He joined Sun Microsystems Inc., Mountain View, CA, in 1992. Since that time he has been Project Manager for development of UltraSPARC microprocessor family. Prior to working at Sun Microsystems, he was Vice President of Engineering at Oasic Technology and worked for 18 years at NationaVFairchild Semiconductor Corporation in various projects as product line director. He has published 27 papers and has been- granted 15 -patents

Inayat Razzack was born in Dacca, East Palustan (Bangladesh) in 1961. He received the B S E E and M S E E degrees at Lonisiana State University, Baton Rouge in 1983 and 1985, respectively. In 1988, he joined National Semiconductor Coiporation, Santa Clara, CA, as a circuit design engineer involved in floorplannmg, symbolic layout, library development, and physical design verification In 1993, he joined Sun Microsystems, Mountam View, CA, as a physical design verification and technology files engineer for the UltraSPARC s involved in technology files development and physical design verification for a number of projects at Sun.

LEV et al.: A 64-b MICROPROCESSOR WITH MULTIMEDIA SUPPORT

James M. Kaku received the B.S.E.E. degree from San Jose State University, San Jose, CA, in 1980. Since 1980, he has been involved in the design of various NMOS and CMOS logic and memory chips. He is currently working at Sun Microsystems, Mountain View, CA, in the development and design of SRAM’s for microprocessors.

Ken Shin received the B.S.E.E. degree from Rensselaer Polytechnic Institute, Troy, NY in 1986. Since 1987, he has been engaged mostly in memory circuit design. He is currently a Member of the Technical Staff at Sun Microsystems, Mountain View, CA, where he was the designer of the Integer Register File for the UltraSPARC. Presently, he is working on the instruction buffer for UltraSPARC 11.

1237

Richard L. Bartolotti was born in July 1949. He received the B.S.E.E. degree from the University of Califomia, San Francisco in 1972. In 1973, he joined Advanced Micro Devices, as a bipolar, ECL LSI circuit design engineer. In 1985, he joined Vitesse Electronic Inc., working on GaAs LSI design. In 1987, he joined Integrated Device Technology where he worked on CMOS custom design. In 1992, he joined Sun Microsystems, Mountain View, CA, to work on library design and methodology for the UltraSPARC Processor.

Robert K. Yu (S’89-M’92) received the B.S. and M.S. degrees in electrical engineering from University of Califomia at Berkeley in 1985 and 1990, respectively. He joined Advanced Micro Devices in 1985 where he worked in ECL circuit design. In 1990, he joined Sun Microsystems, Mountain View, CA, where he designed floating point multipliers in GaAs and CMOS technologies. His interests include VLSI design, computer architecture, computer arithmetic, signal processing, and VLSI CAD.

Marc E. Levitt (S’85-M’90) received the B.S. degree in computer engineering from Lehigh University, Bethlehem, PA, and the M.S. and Ph.D. degrees in electrical engineering from the University of Illinois, Urbana-Champaign in 1986, 1989, and 1990, respectively. Presently, he is a Staff Engineer and manages the SPARC wide testability group in the SPARC Technology Business of Sun Microsystems Inc., Mountain View, CA. He is involved in the designing, testing, and debugging of ASIC’s, SPARC processors, and SPARC-based computer systems and was the testability architect for UltraSPARC. Prior to this, he worked as a Research Assistant in the Coordinated Science Laboratory, University of Illinois and also taught in the Department of Electrical Engineering. He is a coauthor of the “BiCMOS Testing” chapter in the book BiCMOS Technology and Applications (Norwell, MA: Kluwer, 1993). He is also the author of the chapter “Machine Learning for Foreign Exchange Trading” in the book Neural Networks in the Capital Markets (New York. Wiley, 1994). He has over 20 conference and j o ~ r n a l publications and numerous patents in the area of integrated circuit testing. Dr. Levitt is a member of Tau Beta Pi and Eta Kappa Nu.

Michael Allen (M’92) received the B.S.E.E. degree from the University of Califomia at Berkeley in 1975. He worked at Fairchild Semiconductor, Monolithic Memories, Advanced Micro Devices and Synergy Semiconductor doing bipolar ECL LSI circuit design. He joined Sun Microsystems, Mountain View, CA, in 1991, working in the CMOS custom circuit design and development group. He has been awarded six patents with two more currently applied.

Philip A. Ferolito received the B.S. degree in electrical engineering from University of Rochester, Rochester N.Y. in 1990. He worked for two years at Intel Corporation, Santa Clara CA, as a Circuit Designer on low power, high integration chips. In 1992, he joined Sun MicroSystems, Mountain View, CA, where he has worked on RTL and physical design for the prefetch and dispatch unit of UltraSPARC.

Ronald J. Melanson (M’88) received the B.S.E E. and M.S.E.E. degrees from Northeastem University, Boston MA, in 1970 and 1972, respectively From 1972 to 1991, he work at Digital Equipment Corporation for 19 years, workmg on the development of several VAX and PDP-10 systems. He joined Sun Microsystems Inc., Mountain View, CA, in 1991, where he has been active in the development of several SPARC microprocessors, with direct responsibility for the definition and management of semiconductor technology development.

Shailesh I. Shah (M’91) received the B.E. degree in electronics engineering from Baroda, India in 1989 and the M.S. degree from the Ohio State University, Columbus, in 1989 and 1992, respectively. He is a Member of the Technical Staff with Sun Microsystems, Mountam View, CA, where he has been engaged in design of CMOS circuits and logic.

Sophie Nguyen received the B.S. degree in electrical and computer engineering from the University of Califomia, Davis and the M.S.E.E. degree from University of Califomia at Berkeley. From 1985 to 1987, she was a Circuit Design Engineer in the Advanced Product Development Group at LSI LOGIC Corporation., involved in CMOS megacells design. From 1987 to 1989, she was a Senior MOS Memory Design Engineer in the Research and Development Group at Performance Semiconductor, designing high speed cache static rams. Since 1989, she has been a Member of the Technical Staff at Sun Microsystems, Mountain View, CA, working as a Senior Circuit Designer involved in the design of Bicmos Cache Controller datapaths and cell libraries. Recently, she is working on the 64-b Microprocessor in the area of datapath and cell libraries development.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30,NO. 11, NOVEMBER 1995

1238

Sundari S. Mitra (M’88) received the B.S.E.E. degree from Baroda, India, and the M.S.E.E. degree from the University of Illinois, Urbana-Champaign, in 1986 and 1988, respectively She is currently a circuit design manager at Sun Microsystems, Mountam View, CA, responsible for circuit work on UltraSPARC. Previously, she worked at Intel Corporation on a number of 86 chips. She has several patents and has published papers on clock design and distnbution, power design and distnbution, and PLA designs.

Vinita Reddy (M’85) was born in Delhi, India. She received the M.S. degree in physics from the University of Delh, D e k , India and the M.S E.E degree from the University of Wisconsin, Madison, in 1984. In 1984, she joined Philips Semiconductor, where she worked on the design of memones and programmable logic devices. She joined Sun Mmosystems, Mountain View, CA, in June 1992, where she has been engaged in design and development of high speed clrcuits for h g h performance microprocessors.

Vidyasagar Ganesan (M’91) received the B E. degree from the University of Bombay, Bombay, India, and the M.S. degree in electrical engineering from the University of Hawai, Honolulu, in 1988 and 1992, respectively. In June 1993, he joined Sun Microsystems, Mountain View, CA and was involved in the design of adders and custom datapaths

Willem J. de Lange received the B.S.E.E and M.S.E.E. degrees from Twente University of Technology, Enschede, The Netherlands, in 1968 and 1971, respectively In 1970, he joined Philips Research Laboratories, Waalre, The Netherlands, where he was engaged in research on holography. In 1971, he joined the Royal Dutch Army as a Lieutenant of the Technical Staff where he was in charge of coordinating research projects at the Dutch National Research Laboratories In 1973, he joined Siemens A.G., a Design Engineer and later as Project Leader involved in custom integrated circuits for telecommunication and consumer electronics. In 1983, he was assigned to Intel Corporation, Hillsboro, OR and was involved in the design of mcroprocessors, including the 80960. In 1987, he joined lntergraph Corporation, Palo Alto, CA and participated in the design of several generations of the Clipper microprocessor Since 1993, he has been with Sun Microsystems Inc., Mountain View, CA, designing hgh-speed custom circuits and managing the physical verification of the UltraSPARC mcroprocessor.

Lihat lebih banyak...

A 64-b microprocessor with multimedia support

Descrição do Produto

Comentários