A 32b 66 MHz 1.8 W microprocessor

May 28, 2017 | Autor: Stephen Mahin | Categoria: Low Power, High performance, Solid State Devices and Circuits, Test Coverage
Share Embed


Descrição do Produto

ISSCC94 / SESSION 12 / MICROPROCESSORSI PAPER TP 12.4 TP 112.4: A 32b 66MHz 1.8W Microprocessor RolandBechade,Roy Flaker,BruceKaulfmann,SteveKenyon,Charles London,SteveMahin,Kim Nguyen,Dac Pham, Alan Roberts, Sebastian Ventrone, Tim VonReyn IBM Microelectronics, Essex Junction, VT A high-performance 32b microprocessor with an on-chip cache and low power for functional and standby modes has performance of 26MIPS and dissipates only 1.77W in functional mode. The features include a 16b data interface and 24b address bus, a 16kB four-way set associative cache, anda clock doubler for operation at 33/66MHz. The chip is 9x7.7 mm2 (Figure 1). It uses single-latch LSSD design and has 99.5% stuck-at fault test coverage. To limit power dissipation, the power supply was lowered by a 0.8pm CMOS process tailored for 3.3V. This process has singlelevel polysilicon, with silicided diffusion and polysilicon for low resistance. The nominal effective channel length is 0.45pm while gate oxide is 12nm. Tungsten is used for local interconnect between n and p diffusions and the polysilicon. The design uses three metal levels and has a 2.4pm contact pitch. To reduce power, switching factors in functional and standby modes are reduced. For the functional mode, a static bus is selected because it offers a lower switching factor, dissipating power only when data changes states. A static bus also has lower capacitance because it is segmented by multiplexers, but this approach requires more multiplexers. The number of multiplexers almost doubles when compared with dynamic bus design, with the four-way multiplexers being the maximum size allowed in chip design. The four-way multiplexers (Figure 2) in the data flow are self-decoding, simplifying testing because the output ofthe multiplexer is never floating. The data are also driven rail-to-rail with a parallel path of n and p devices. The four-way multiplexer is the maximum size used to reduce wiring congestion and limit stacking of devices. A dynamic Manchester-carry chain adder could not be used becausqof the static logic, so a compact double-carry select adder with a 3.2ns delay (Figure 3) was used. In this adder, a signal generator creates (at each bit position) the true and complement of the carry and the complement of the sum, assuming a carry-in of 0 and 1. The adder is segmented into sevrral groups of bits. Within each group, two selector chains select the appropriate sum and carries, for an initial carry of0 and 1.The selectors are complementary pass gates to provide full signal swing. At the end of each group, a second selector (controlled by the carry into the group)selects the proper carry for the next group. Theselectors for the sum are decoupled from the carry selector to reduce the load on the carry chain. When full performance is not needed, several power-saving modes reduce power dissipation. The modes can be enabled by special machine-specific registers. The first of these is the lowpower halt mode, allowing the processor to stop its internal clocks whenever a “halt”instruction is executed. The processor remains in this mode until an interrupt or reset is detected. Due to static design, contents of the processor latches are protected during the time clocks are stopped. ‘Thesystem also can change input clock frequency. When it first asserts a dynamic frequency request (DFR),the processor switches to a special clock generator and sends a “dynamic frequency shift ready” back to the system. The input frequency can then be

_ _ . -

shifted to the different clock rato. When rcturnlrrg to thc faster clock rate, the DFR signal can be dropped and the provessor automatically resets the clock shaper, thermeby returning the processor t o the clock doubler rate. If the system desires to disconnect the proccssor from VDD, the power-management architecture allows all the software registers to be saved in a special memory area. The power management software ran then set a bit in an I/O port to indicate that the processor is in shutdown state, and then power down the processor. At power up, the system re-enters the power management mode, restores the register contents and, by executinga power management return instruction, returns the processor to the program state it was executing prior to shutdown. This allows users to turn off a computer for an extended time and return to the preshutdown state. The power-saving modes are independent of USER code, but are dependent upon system implementation. For afasttransitionbetweenfrequencies, a digitalclockshaper is used (Figure 4). A reference pulse is generated on the rising edge ofthe incoming clock signal and sets a setheset latch. The input clock also propagates through a delay line. At each stage of the delay line, a pulse is generated and compared to the reference pulse. At the delay stage where the two pulses are coincident, a latch (n) is set, activating a pull-down device in a reset network. With the next incoming clock, the pulse generated a t stage d 2 discharges the reset network and resets the output latch. Only two cycles are necessary for the clockshaper output to be valid. The integrated cache unit (Figure 5) includes a 144kb, fourway set associative unified cache, 20kb TAG array, 2kb LRU array, lkb state array, comparators, and control logic. The synchronous self-timed design achieves a 6.9ns clock-to-data access, a 5x1s clock-to-hit access, and a 12ns cycle. All arrays perform a read-modify-write operation in one cycle. The integrated cache unit occupies a 34mm2 and uses a six-transistor cell. The cache architecture is driven by low-power considerations. Common word-decode and redundancy circuits are used for all arrays. The number of cells selected is minimized by using segmented word lines and byte controls in the data array. Cache circuit approaches are driven by both low power and high performance. Static address buffers and decoders reduce power, and a no-operation (NOOP) control signal depowers the cache unit when idle. Sense amplifiers are latched only when selected by a “hit.”Self-timed circuits minimize time for array-cell selection. Data outputs are latched and switch only if data changes in the load cycles. Output switching does not occur during stores. All cache-unit logic is custom-designed to improve power, area and performance. Cache power is 400mW at 3.3V, 66MHz and 50% utilization. There is no standby power other than that from leakage. Table 1 shows power use for 3.3V and 33/66MHz. Switching factors for the different elements are also provided. Acknowledgments The authors acknowledge contributions of R. Battaline, R. Benson, J. Bergkvist, M. Bolliger, G. Braceras, A. Davis, D. Dougherty, J. Geissler, S. Gould, P. Kartschoke, D. Lavalette, P. Perry, J. Raymond, K. Shaw and N. Suzuki. Reference

[ll Nguyen, K., et al., ”A High PerformancefLaw Power 16kB 4-Way Associative Integrated Cache Macro,” Proceedings of the IEEE Custom Integrated Circuits Conference, May 1993.

ISSCC94 /THURSDAY, FEBRUARY 17,1994 I BUENA VISTA / 3:OO

Figure 1: See page 340.

Data 1

Data 2

L

I

Data 3

Data 4

L

A

Se1 1

Se1 2

out Figure 2: Self-decodingmultiplexer.

,

Clock in

t

Delay and Pulse Generators

Pulse

IGenerator

t

Figure 3: Double-carry select adder.

Network

Data

-

Taa

State

seq que

WordColumn Decode

Data

Clock

N - l , 6 " 1 4 . Y .

L

Latch/ Buffers

Figure 4: Digital clock shaper. I/O Latch 110 Latch

G Unit Cache Clock

TLB Control logic Data flow

I/O PLA

ROM Total

Power (W) 0.400 0.330 0.293 0.290 0.194 0.126 0.096 0.037 1.770

Gil

I

Switching Factor 0.5 2.0 1.0 0.14 0.15 0.06

_____

0.28

Table 1: Power dissipation analysis.

I?igure5: Integrated cache block diagram.

J

PM

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.