A FIFO Data Switch Design Experiment

September 11, 2017 | Autor: Scott Fairbanks | Categoria: Data Dependence, Circuit Design, High Speed, Chip, FIFO, Asynchronous, Data Flow Diagram, Asynchronous, Data Flow Diagram
Share Embed


Descrição do Produto

A FIFO Data Switch Design Experiment

William S. Coates, Jon K. Lexau, Ian W. Jones, Scott M. Fairbanks, and Ivan E. Sutherland Sun Microsystems Laboratories, 901 San Antonio Road, Palo Alto, CA 94303-4900

Abstract A core problem in many pipelined circuit designs is data-dependent data flow. We describe a methodology and a set of circuit modules to address this problem in the asynchronous domain. We call our methodology P**3, or “P cubed.” Items flowing through a set of FIFO datapaths can be conditionally steered under the control of data carried by other FIFOs. We have used the P**3 methodology to design and implement a FIFO test chip that uses a data-dependent switch to delete marked data items conditionally. The circuit uses two on-chip FIFO rings as high-speed data sources. It was fabricated through MOSIS using their 0.6µ CMOS design rules. The peak data switch throughput was measured to be a minimum of 580 million data items per second at nominal Vdd of 3.3V.

1. Introduction For some time our research group has been concentrating on the design of high-performance asynchronous components. In [6] we presented the results of a FIFO ring performance experiment which demonstrated that an asynchronous FIFO datapath could be made to run every bit as fast as a similar clocked circuit. This result was encouraging, but the question remained as to whether high performance could be maintained in a more complex circuit. In particular, we were interested in systems that exhibit data-dependent data flow. Our work on the counterflow pipeline [8] provided us with many challenging design examples suitable for follow-on chip experiments but also pointed out some shortcomings in our design methodology. The formidable circuit complexity we were facing suggested the use of a higher-level design approach, but we found that abstraction often came at the cost of reduced performance. We developed a design methodology we called P**3, or “P cubed,” which provided us with a level of design abstraction while still maintaining a tight coupling to the underlying circuits. The name P**3 arose from the names of three basic circuit primitives on which the methodology

is based. We have called these modules the Path, the Place, and the Port. In addition to a circuit primitive implementation, each of these modules has both a notational representation and a procedural model for simulation. These three simple modules, along with minor variants, compose easily into complex systems that retain the high performance of the primitive elements. In this paper, we describe a FIFO chip experiment that expands upon our previous work. The primary goal of the experiment was to build on our experience with high-speed FIFOs to produce more complex systems without unduly sacrificing performance. As in [6], we aimed to achieve maximal performance by careful attention to localized timing relationships. A secondary goal of the experiment was to gain experience in the employment of our P**3 methodology as a design tool. The test circuit was implemented as a FIFO with a “conditional drop” capability. Data items flowing in the FIFO pass through a P**3 Port, which is essentially a highspeed data-dependent FIFO switch. A sequence of data bits carried in a separate FIFO controls the Port. Though simple, the resulting circuit demonstrates the core functionality required to implement a wide variety of useful data-dependent processing functions. Two on-chip FIFO rings serve as high-speed data sources for the structure under test. This makes it possible to exercise the test circuit at full speed while placing minimal performance demands on the off-chip testing environment. Each ring may also be tested and characterized separately in a straightforward manner. A first test chip was fabricated using the MOSIS 0.6µ CMOS process. The measured maximum throughput of the test circuit with Vdd at nominal 3.3V and chip temperature near ambient was 629 million data items per second when all items are deleted from the data stream, and 583 million data items per second when all data items are passed along. All chips that we tested were functional, although we observed failures during certain high-speed switching tests. We tracked the problem to a design flow error and submitted a second chip to correct the problem. As this paper goes to press, we can present data from only the first of the chip runs.

2. Related Work

3. The P**3 Methodology The P**3 methodology evolved from a conjunction of several factors. One of these was circuit design. We experimented with several methodologies for specifying behavior at a high level, but none provided both the ability to specify low-level concurrency and an easy translation to hardware. Another motivating factor was simulation. We

3.1. The Path The basic P**3 Path is designed to control the copying of data between two adjacent stages of a FIFO, synchronizing the communication between a data sender and a data receiver. Its external interface is shown in Figure 1. The interface consists of an input and an output datapath of arbitrary width plus a handshake interface for each. We use the hollow pipe symbol shown in bold inside the box as a shorthand notation for the Path module. Data Valid in

Data Wanted

Data Copied

Data Valid out

Data in

Data out

Receiver Interface

A large number of techniques and tools exist for specifying, analyzing, and implementing asynchronous systems. An excellent overview of some of the methods currently in use can be found in Hauck [4]. Possibly the first project to explore the use of a standard set of asynchronous modules as system building blocks was the Macromodules project [2]. It clearly demonstrated the modularity advantages of asynchronous interfaces. More recent methodologies, such as the Tangram system in use at Philips Research Labs [10] and the work of Burns and Martin [1], use syntax-directed translation to map a general class of high-level specifications into networks of communicating asynchronous components. Large designs have been successfully fabricated using these approaches [5][11]. In contrast, we developed our methodology to help us produce high-performance implementations drawn from a much smaller design space. The methodology arose from our work with asynchronous FIFOs. Such FIFOs have received wide attention as an implementation medium [3][6][9][12]. P**3 designs are specified at a relatively low level and each notational primitive has a one-to-one mapping with a simple circuit module. Restricting ourselves to FIFO-style control logic minimized the number of standard components required, while at the same time guiding our thinking in directions we might not otherwise have taken. Our methodology most resembles that of Sparsø and Staunstrup [7], which also utilizes a small set of primitives to enable data-dependent data processing. Our notation can specify systems at a slightly higher level and partitions the problem somewhat differently. Also, the module implementations presented in this paper are tuned for performance rather than delay-insensitivity.

were in the process of implementing a behavioral simulator in C++ and wished to develop a minimal set of simple data classes that would allow us to model the general types of circuits we use, as well as produce believable performance estimates. The general class of circuits we model are asynchronous pipelined structures with data-dependent data flow. For example, data items flowing in one set of FIFO datapaths might be conditionally forked, merged, duplicated, or deleted based on control information carried in separate FIFOs. Because of this, we also required an explicit datapath representation in our methodology. Over time, we converged on a minimal set of three basic circuit modules which we called the Path, the Place, and the Port. Combinations of Path, Place, and Port elements form a structural skeleton in which to embed specific datapath and computational circuits. This section will present a very brief overview of the P**3 module interfaces, notational symbols, and sample implementations. We will then illustrate the use of the notation using several simple specifications. The examples that follow are implemented using transition signalling and bundled data.

Sender Interface

In the remainder of this paper, we will first give some background information, then a brief overview of the P**3 design methodology as it relates to the design of the test chip. We will then explore the design of the chip itself in more detail. Finally, we will present some results and draw conclusions.

Figure 1: Path Interface The operation of the Path is very simple. An input request on the receiver interface indicates that the recipient is willing to accept a new datum. The datapath can be made transparent at this point. A corresponding input request on the sender interface indicates that the sender is holding new data stable. When both requests have arrived and the new input data has been successfully copied to the output, the datapath is made opaque and both acknowledge signals are sent. The sender is now free to change the value of the input data if desired; similarly, the receiver is free to make use of the new data that has been supplied.

The Path implementation shown in Figure 2 is aggressively simple, consisting merely of a rendezvous (implemented with a Muller C-element) and an inverse toggle (implemented with an XOR) that controls a pass gate. When both input requests have been received, the Celement will fire, causing the pass gate to become opaque and capture the input data. The signal is also forked to “simultaneously” acknowledge both requestors. A subsequent request from the receiver will make the pass gate transparent, copying the input data to the output. This circuit is obviously not speed-independent and relies on very careful control of timing to function correctly. Note, however, that this is an implementation choice rather than a property of the interface definition. Note also that there is no feedback circuit shown to keep the copied data, Dataout, static when the pass gate goes opaque. The required keeper circuit is located in the Place component. Data Valid in

Data in

Data Valid out

Data out

Figure 2: Path Implementation

3.2. The Place

Full

Go Full

Go Empty

Data in

Data out

Output Interface

Input Interface

We use the Place component to represent where the data conceptually is located. Its interface and notational symbol are shown in Figure 3. A Place alternates between the states of empty and full. When empty, it signals its willingness to accept new data; when full, it signals its readiness to supply data. Data copied into a Place will remain static until explicitly overwritten. Empty

Full

Go Full

Go Empty

Data in

Data out

Figure 4: Place Implementation A repeated connection of alternating Paths and Places forms a functional FIFO structure, in this case a transition micropipeline. With the addition of the Port component, discussed next, flexibility and functionality are added to the basic FIFO structure.

3.3. The Port

Data Wanted C

Data Copied

Empty

Figure 3: Place Interface The Place circuit shown in Figure 4 is also very simple. The single inverter in the source acknowledge path initializes the Place “empty,” in effect generating an initial transition after power-up. The triangle with a dot inside symbolizes a “sticky buffer” which will retain data supplied to it. This is the keeper circuit mentioned in the previous section.

The Port is a more interesting P**3 component; it is the one which enables data-dependent data processing. Its task is to synchronize a control stream with a corresponding data stream. There are two basic types of Port, an output Port and an input Port, named with respect to the adjacent Place. An output Port is interposed between a Place’s output interface and a Path’s sender interface. It allows conditional data deletion. An input Port is situated between a Path’s receiver interface and a Place’s input interface. It allows conditional data duplication. The Port external interfaces, symbols, and basic implementations appear in Figure 5. The input and output Ports are shown together “in context,” with a Place between them. One should imagine Path components located at the far right and far left to complete the picture. The additional “control” interface on each Port connects to the output interface of a Place, not shown, that supplies control information. A single control datum is consumed for each transaction. One bit of control data is required; the value of this bit determines the action to be taken when a request arrives at the input data interface of the Port. The basic Port implementation is more complex than the other components, consisting of a Muller C-element, a transition selector, and an XOR. The C-element acts as a rendezvous that synchronizes transactions on the data stream and the control stream, generating a request to the selector module only when both have occurred. Based on the value of the bundled data bit supplied by the control stream, the selector will either acknowledge the Place immediately or generate a request to the appropriate Path. The XOR here implements a merge function that generates the Place acknowledgment. Consider the output Port at the right of Figure 5. A request from the sending Place indicates that the Place is

Control Data CGo Empty Control Full

Control Full Go Empty Control Data Data Wanted

Receiver Empty

Sender Full

Data Valid

Go Full

Go Empty

Data in

Data out

Data in

Place

Input Port

Data Copied Data out

Output Port Control Data

CGo Empty

Control Full

Control Full

Go Empty

Control Data

Data Wanted

Data Valid

0

0 in

1

C

Receiver Empty

Sender Full

Data Valid

Go Full

Go Empty

Data in

Data out

Data in

C

Data Valid

in 1

Data Copied

Data out

Figure 5: Input and Output Ports in Context full and wishes to supply data to its Path. The Port will wait until it receives a corresponding request from the control Place before taking any action. If the control data bit is a “1,” indicating that the corresponding “through” datum should be deleted, the Port sends an acknowledgment to the sending Place immediately, causing it to become empty. The Path on the other side of the Port remains ignorant of the transaction. If the control data bit is a “0,” the Port sends a forward-going request to its Path, which will eventually cause a data copy to take place. The acknowledgment signal from the Path, indicating that the data have been copied, will be routed to the sending Place and allow it to go empty. The Port also routes an acknowledgment signal to the control Place, allowing it to go empty and to later be refilled with fresh control data. The input Port operates in a similar fashion. The input request from the receiving Place indicates that the Place is empty and wishes to be refilled. Data from the control Place is used to determine whether the receiving Place should be acknowledged immediately, thus reusing the old, previously latched data, or whether the Path should be requested to supply a new datum. If data are to be copied unconditionally from Place to Place, i.e. as in a FIFO, the Port control circuitry is not

necessary and can be “optimized away,” leaving nothing but a set of datapath wires. This is a common enough case that we use a special notation consisting of a simple arrow that connects the Place to the Path, or the Path to the Place. The direction of these arrows serves to distinguish input and output interfaces on the Paths and Places. We tend to refer to these as null input or output Ports to distinguish them from the controlled Ports shown in Figure 5.

3.4. More on Circuit Implementation The implementations shown in the previous figures have been simplified for clarity. State-holding components, such as the Muller C-element and the selector, require a reset signal, which has been omitted in the figures. In addition, each control handshake signal in our chip experiment is carried on two separate wires. These wires carry complementary (True and False) versions of each signal. Although wasteful of area, this allows logical signal inversion to be performed using a simple wire swap, as well as offering speed advantages. Figure 6 shows the control portion of the standard Path module as it was implemented. Note the several levels of buffering and also the duplication of circuitry required to

implement the complementary signalling. The XOR / XNOR gates, which generate the latch control signals, are implemented using pass gates and so require both true and complemented versions of each of their inputs. Data Valid in

T

T

F

F

C

Full a

Empty

Full b

Go Full

C

Data Valid out

C

Data out a

Data in

Data out b

Reset

Data Latched

T

T

F

F

Go Empty a Go Empty b

Data Wanted

Figure 7: Forking Place and Symbol

3.5. Some P**3 Examples

T

F

To Data Latch Control

These three skeleton P**3 component types can be connected in various configurations to implement a wide variety of data-processing systems. The simplest example is that of a FIFO segment. Figure 8 shows a possible P**3 specification. Note that no controlled Ports are needed because data movement is unconditional.

Figure 6: Path Implementation Detail The Place and Path components can be extended in a straightforward way to accommodate multiple senders or multiple receivers of data. For example, Figure 7 shows a Place configured to fan out to two output Ports. In this case, a Muller C-element is required to rendezvous the two acknowledge signals; the Place generates an input acknowledge, i.e. declares itself empty, only after both receivers have acknowledged. The symbolic representation at the bottom of the figure shows the Place connected to a null input Port and two null output Ports. The P**3 circuits presented so far are by no means the only possible implementations. One of the advantages of the asynchronous design style is that we can concentrate on interface specification and defer implementation details to a later design stage. The choice of a two-phase signalling protocol combined with a single-rail bundled datapath for our experiment was somewhat arbitrary. We have experimented with several interface protocols and circuit variants, including the asP* protocol reported in [6]. So far, all have mapped well into the P**3 framework.

Figure 8: P**3 FIFO Figure 9 shows a slightly more complex example of a FIFO that includes a controlled output Port to drop data items conditionally. If a sequence of boolean “0” values is supplied to the Port’s control Place at the right-hand side of the figure, the rest of the system will behave like a normal FIFO. Each “1” that appears in the control stream will drop a single datum.

Figure 9: FIFO with Conditional Drop

The use of Ports in combination with forking and joining datapaths can be used to steer data conditionally rather than merely duplicating or destroying it. Multiple input Ports converging on a single Place can be used to implement multi-ported latches, where data may be loaded from a selected Port for each transaction. A Place with multiple output Ports sends data to selected receivers for each transaction. Figure 10 shows a specification fragment that splits a single FIFO into two streams which are then recombined. Note that the left and right branches may differ in length, although throughput or latency may suffer if care is not taken in the design.

sources implemented as FIFO rings, one to feed the data interface of the P**3 Port under test, and another to supply control information. Each ring is tapped, using a forking Place as in Figure 7, to implement a continuous, albeit repetitive data stream. Note, however, that if the numbers of items loaded into the two rings are relatively prime, the test repeats only after the product of those two numbers. The output of the Port goes into a FIFO tail, which can store enough data to provide a meaningful snapshot of recent activity. Given this basic structure, the next task was to design and implement the external interface. A typical experimental run, for example, consists of loading the data and control rings, starting the experiment, stopping the experiment, and reading the resulting data out of the chip. Other experiments selectively drop a majority of data items so that the external data rate at the end of the long tail is low enough for real-time monitoring.

4.2. External Interface Details

Figure 10: Branching and Merging FIFOs

4. Experiment Design We settled on the conditional drop configuration of Figure 9 for a test chip implementation. This is about the simplest configuration that includes non-trivial examples of all three P**3 primitives, and represents an incremental step up in complexity from our previous FIFO experiment. The challenge was to design an experiment that would let us test this circuit at speed. This required devising a method to deliver data rapidly to both inputs of the Port under test and to rapidly absorb those data elements that make it through the switch. We also wanted to control the experiment externally using relatively low-speed testing hardware.

4.1. Data Sourcing and Sinking Rather than streaming data on- and off-chip at high speed, we chose to implement self-contained data sourcing and sinking mechanisms on the chip. The chip has two data

Figure 11 shows a conceptual overview of the experiment including some of the external control circuitry. The data ring, control ring, and tail can be seen as described above. The short bars represent locations where we wish to connect external signals to the chip. In addition to the obvious interfaces marked on each of the rings and at the end of the data tail, there is one marked on each tap between the rings and the P**3 Port. These permit each ring to be logically “disconnected” from the rest of the experiment so that we can test and characterize the rings separately. We chose the rings to have different lengths in order to fill in more data points in the normalized throughput vs. fullness curve (see Section 7.1). The smaller ring was chosen to have 32 stages in order to discredit any superstition about a ring needing an odd number of stages in order to oscillate correctly.

4.3. The Observing Port One way to implement all of these external interfaces would be to use “standard” P**3 Ports. For example, a merging Place configuration as shown in Figure 10 could be used to load the rings by selectively accepting external data instead of ring data. Unfortunately, crossing the chip boundary would produce a performance bottleneck and make it impossible to run the chip at maximum throughput. We found, however, that a small modification to the Port circuit would allow us to circumvent this problem while still adhering to the P**3 methodology. We called this modified design the observing Port. A preliminary implementation of an observing output Port is shown in Figure 12 along with its P**3 notational symbol. Essentially, the control data input of the Port remains, but

For the final chip implementation, we added another level control input to the observing Port, so that in addition to the normal data steering function, we could block, or disable the Port. When a Port is blocked, input requests will never be acknowledged. This capability was used to implement a start/stop mechanism.

Data Ring (37 Stages)

Data Stream

4.4. Datapath Issues

Control Stream Control Ring (32 Stages)

Port

Output Tail FIFO (41 Stages)

Figure 11: Chip Experiment Overview the control handshake interface is eliminated. This means that the incoming control stream is no longer synchronized with transactions on the main data stream. A single external level signal can thus be used to control the selector module within the Port, with no high-speed handshake signal crossing the chip boundary. Of course this imposes severe timing restrictions on when the external control may safely change. The observing input Port, not shown, may be implemented in a similar manner.

Control Data Sender Full

Data Valid Data Copied

Go Empty Data in

Data out

Control Data 0

Sender Full

Data Valid

in 1

Go Empty Data in

Data Copied Data out

Figure 12: Observing Output Port

The P**3 notation as presented implicitly specifies a set of datapath latches but is silent regarding the details of the datapath. At present, we annotate our P**3 diagrams to indicate such details as datapath width and bus taps. Computations may be specified in the form of arbitrary combinational functions placed in the forward datapath within any component. This may require adding completion signalling or matching delay elements in order to preserve the interface bundling constraints. For this experiment, each ring contains a four-bit wide datapath, loaded serially in the same manner as in [6]. Each ring contains a 1-bit circular shift at the interface stage. At the fork in the data ring, all four bits are tapped and are gated by the P**3 Port. Four bits are also tapped from the control ring, but only one of them controls the Port.

4.5. Top-Level Design Figure 13 shows a P**3 specification for the entire experiment, although most of the repeated FIFO elements have been omitted for simplicity. Instances of triple dots (. . .) between ports represent missing FIFO stages, with a label indicating the number of omitted places. The small squares represent off-chip connections; the number inside the square indicates how many pins are required. The I/O interface to each of the control and data rings contains four observing Ports, requiring a total of 8 pins per ring. The external data source and sink for each ring are treated as logical P**3 Places, each requiring one request, one acknowledge, and one data pin. The forking Place at the bottom of each ring feeds a short FIFO connected to the output Port we wish to test. The Port under test feeds into a relatively long, 40-stage FIFO tail which ends with another observing Port data output interface. This interface requires two control and four data signals. There are two other “short tails” that fork off from the ring tail FIFOs and end with an observing Port. These observing Ports are provided as a means to stop the associated rings. Use of the observing Ports, labelled op1, op2, etc., will be discussed in the testing section later in this paper.

3

op1

3

4

4

op2

op3

3

op4

op5

Data Ring (37 Stages)

4

4

op8

op6

op7

Control Ring (32 Stages)

15

19

3

19

10

2 op9

p1

2

39

op10

op11

op12

2

1 6

op13

1

Figure 13: P**3 Experiment Specification

5. Implementation To capitalize on the modularity of the P**3 notation, we developed standard cell layouts for each of the primitive elements. These included all of the modules discussed so far, as well as a small number of optimized variants. Including these variants brought the total number of cells needed for this implementation to 16. This greatly reduced the amount of hand-layout required. Because our notation explicitly shows connectivity and implicitly includes the

datapath, routing could be accomplished by simple abutment of cells. Once the cells were placed and abutted, only the final routing to the pads remained to be done by hand. Figure 14 shows a detail of the data ring external interface standard cell placement. Note how the Port cells are doubled in height to reflect the fact that they, in effect, merge two FIFO streams. Note also that all Ports except for the observing Ports are null Ports and do not require an explicit standard cell. Their logical placement appears as small arrowheads in the figure.

For this initial implementation, we considered modularity and ease of implementation our primary criteria. We may wish in the future to redesign our standard cell layouts to reduce wire lengths and allow more general datapaths. Figure 18, located at the end of the paper, shows a layout plot of the final chip as submitted to MOSIS. 2

2

2

2 1

1 1

1 1

Testing was accomplished using a stand-alone Apple Macintosh-based test platform along with a custom circuit board to contain the test chip. Each ring is loaded serially with the desired number of full cells and data pattern, and then the observing Ports are configured for the particular test. Some tests, such as throughput measurement, can be run continuously while others, such as checking for correct output Port function, are run as a “burst test” where the results are stored in the long central tail for later checking at leisure.

7.1. Characterization of Individual Rings to ring

from ring

1

7. Testing and Results

Figure 14: Interface Cell Placement

6. Simulation Our functional simulator, written in C++, and subsequently ported to Java™, accepts a textual version of P**3 specifications directly as input. Each P**3 primitive has a corresponding behavioral model implemented as a C++ or Java object. Inter-object function calls are used to model the communication interfaces between components. Each object also has a corresponding delay model to give us preliminary performance estimates from the simulator. These delays are calibrated against Spice simulation results. Because only a limited number of component configurations appeared in our design and wire lengths in the layout were fixed by abutment, it was fairly straightforward to extract the relevant delay values from relatively small circuit simulations with Spice. We also implemented a Verilog model of the chip design for comparison purposes, which was also calibrated with these same delay values. The Verilog model provided more lowlevel circuit detail, at the expense of longer simulation run times. Both simulators were configured to record selected timestamped events in a “journal” file. We developed a tool to display these journal files graphically as an aid to understanding system operation. Most design errors could be detected at a glance in this way, while performance could be estimated from the time axis. C++ simulation was also used to help develop a test suite for the chip. Starting with simple load/unload and datapath tests and adding incremental complexity, we developed the test vectors to exercise the silicon.

As in [6], we first measured throughput and power consumption for each of the rings operating independently, without any connection to the output Port under test. Figure 15 shows the throughput vs. number of full cells for the two rings at Vdd = 2.4V, 3.3V, and 4.2V. The extreme voltages are approximately 27% above and below the nominal supply voltage. These test voltages were chosen because they represent a generous expansion of the typical voltage range specified for synchronous parts, which is usually ±10% from nominal. Results for both the 32-stage ring, shown dotted, and the 37-stage ring are given on the same graph. Note that if the number of full cells were normalized to the total number of cells in the ring, then the pairs of curves would lie exactly on top of one another. The performance curve has a flat top because there is a stage (or stages) of the ring that cannot operate as quickly as all of the other stages. For this chip, as predicted by our simulations, the external interface is the bottleneck. The curves for power consumption vs. number of full cells are not shown here, but have exactly the same shape as the throughput curves. At the nominal supply voltage, 3.3V, the peak throughput is 670 Mwords/sec. Power consumption at this throughput is 566 mW for the 37-stage ring. Projecting the two sloped segments of the performance trapezoid gives the approximate performance of the FIFO stages not including the interface stages. For the nominal supply voltage, this projection yields a raw FIFO throughput of 820 Mwords/sec.

7.2. Output Port Performance Once we were convinced that the individual rings were working correctly, we were ready to connect the rings to the output Port, labelled p1 in Figure 13, and verify its functionality.

900 815 Mw/s

800

Vdd = 4.2V

throughput (Mwords/sec)

700

670 Mw/s Vdd = 3.3V

600 500

475 Mw/s Vdd = 2.4V

400 300 200

37-stage ring 32-stage ring

100 0 0

4

8

12

16 20 number of full cells

28

24

32

36 37

Figure 15: Throughput vs. Number of Full Cells

1000 900 peak throughput (Mwords/sec)

From a performance point of view, there are two particular control ring patterns of interest, because they bound the throughput of the output Port. The first is where the control bits are all “1” and all of the data ring values are dropped. The second case has all of the control bits set to “0,” indicating that all of the data ring values should be passed on to the long tail. We find that the throughput in both cases is somewhat slower than that of the individual rings, thus the output Port limits our maximum performance. We also find that the drop case is slightly faster than the pass case. Figure 16 shows how the peak measured throughput varies with supply voltage for both the pass and drop cases. The peak throughput of a single ring is included for reference. The curve labelled “one ring” shows the maximum throughput for a single ring with the observing Port op9 set to disconnect the ring from the output Port. For the next lower curve, labelled “all drop,” both rings are running at their maximum rate and the observing ports op9 and op10 are set to connect both rings to the output Port. However, because the control ring data bits are all “1,” the output Port will drop all of the items, so no data goes into the long tail. For the final case, labelled “all pass,” all of the control ring data bits are “0” and thus all of the data ring items are passed on to the long tail. We measured the single-ring throughput at the nominal supply voltage to be 670 Mwords/sec.The throughput in the “all drop” case is 629 Mwords/sec, and for the “all pass” case, the throughput is 583 Mwords/sec.

800 700 600 500 one ring all drop all pass

400 300 200 0

1

2 3 4 supply voltage (Volts)

5

6

Figure 16: Peak Throughput vs. Supply Voltage Figure 17 shows how the power consumption varies with supply voltage under the same conditions described above. It may be somewhat counter-intuitive to note that the power consumption increases significantly as the throughput of the system decreases. However, if we

consider what is happening in each case, we will see that there is actually no contradiction. In the “one ring” case, only one ring is running, obviously. When we go to the “all drop” case, both rings are running, albeit slightly more slowly. The output Port is blocking any further action, so we should expect the power consumption to nearly double, as it does. In the “all pass” case, both rings are again running, but now all of the data items are passed down the long tail, which consumes even more power, in spite of the reduced throughput of the entire system.

5

peak power (Watts)

4

one ring all drop all pass

3 2 1

We have situated observing Ports at distances zero through four from the ring, as well as at the end of the long tail. Examples of distance zero and one stopping points have been given above; distance two, three, and four stopping points are at observing Ports op10, op12, and op13, respectively. Stopping behavior is checked by reading the data back out from the rings and checking whether it has been corrupted in any way. Using the distance-zero stop point at observing Port op2, 0.51% stopping failures were detected in 10,000 tests. It is also possible to use the output Port op1 to stop the ring, but what is its distance from the ring? Although it would at first appear to be distance zero, it is actually slightly further removed, because of the C-element in the forking place. Stopping the ring via observing Port op1 yielded a failure rate of 0.04% in 10,000 tests. One would expect at least an order of magnitude reduction in the number of failures with each Path added. In fact, we were not able to observe any failures in 50,000 tests using ports at a distance of one or greater. This corresponds to stopping using observing Port op9 in Figure 13.

7.4. Output Port Robustness

0 0

1

2 3 4 supply voltage (Volts)

5

6

Figure 17: Power vs. Supply Voltage

7.3. Stopping Behavior Although we could have implemented an error-free, arbitrated stopping mechanism on this chip, as we have done in previous designs, we decided instead to explore the effectiveness of adding FIFO stages to decrease the chances of a metastability-related failure affecting the correct operation of our test chip. Each of the observing Ports can be set to the block state via an external signal, which will prevent any future transitions from passing through the Port. However, if there is a transition passing through the Port coincidentally with the arrival of the block signal, correct data transfer may fail in some way. An objective in the design of this chip was to position observing Ports at a variety of distances from the two rings so that we could study the failure rates when different Ports are used to halt the ring from running. For example, the observing Port labelled op2 in Figure 13 is in the data ring, so we could say it is at distance zero from the ring. The observing Port labelled op9 is one Path, or FIFO stage, away from the ring, so we could say that it is at distance one.

The output Port performance testing described in Section 7.2 was carried out using “all pass” or “all drop” control sequences to bound the throughput of the system. Because the control data stream maintains the same value for the duration of the test, the data setup time at the output Port selector module is guaranteed to be met. To test that the required data bundling constraint is satisfied under worst-case conditions, we ran additional tests using an alternating “pass/drop” control stream. Unfortunately, we found that in cases where alternating control transactions are applied to the Port at their maximum rate, data items would sometimes be dropped erroneously. This led us to suspect a violation of the data bundling constraint at the control input of the selector. Rechecking our timing analysis, we found that the circuit schematic used as the basis for verification did not precisely match the final layout. Performing a timing analysis using the correct version of the schematic indicated that the fabricated circuit should indeed fail in the observed manner! We modified the Port circuit and resubmitted the chip for fabrication, but have not yet received silicon. Except for the case described above, we found that all of the tests performed worked consistently down to a supply voltage of 2.0V. Due to limitations in the current test setup, we are not able to perform all tests at lower supply voltages. However, we have been able to infer correct single-ring functionality at supply voltages as low as 1.5V. We are planning to modify the test hardware to allow tests

at lower supply voltages so that we may determine the exact failure points of the system under various test conditions. To date, all tests have been carried out at ambient room temperature, but we plan to conduct tests over a wide range of temperatures.

8. Conclusions This experiment was as much a methodology test as it was a circuit test. We feel that we have learned a lot from both the design and testing exercise in each of these areas, and have made a good step in the direction of building complex, very high-speed systems. There remain some aspects that we wish to improve.

8.1. Circuit Issues The circuit design style we chose was somewhat arbitrary. Our methodology allowed the choice of any implementation satisfying the required signal interface. We chose two-phase event logic and used two complementary wires per control signal, more from an intuition that this would lead to faster circuits than for any other reason. The lack of extraneous “return-to-zero” transitions and the ability to perform logical inversion with a wire swap offered potential performance advantages. Unfortunately, not all of these advantages were realizable in practice. The use of complementary signalling nearly doubled the area of some layout modules and increased wire lengths substantially. The use of two-phase transition signalling provided a “clean” interface between modules, both at the circuit level and in the simulator software model. However, this came at the cost of a more complicated external test interface. In fact, it was possible to “inject” spurious transitions into the control and data rings via the external interface, shown in Figure 14, when the observing Ports controlling that interface were set to certain states. Fortunately, the flexibility of the external interface makes it possible to avoid these undesired states, and allowed testing to proceed. Partly as a result of designing this experiment, and partly as a result of progress we have made since then, we are confident that we can demonstrate significant improvements at both the circuit and the layout level.

8.2. Timing Verification We firmly believe that to achieve competitive performance in a world dominated by highly optimized clocked circuits, we will be forced to adopt many of the same optimization strategies. We have used careful control

of local timing to optimize our circuits. Unfortunately, accurate timing characterization is a very difficult task. Even worse, we cannot simply “turn down the clock” if we get it wrong! The problem of timing verification can be broken into three sub-tasks. The first is to identify the timing relations that must hold for correct operation. The second is to determine which circuit modes will exhibit worst-case timing, and the third is to verify that this worst-case behavior will still satisfy the required constraints. The first two tasks were accomplished using a mixture of formal and ad-hoc techniques. Most constraints were required to ensure that data bundling conditions were satisfied at the data latches in the Path and at the selector module in the Port. The verification step was performed using Spice circuit simulation after the schematics had been back-annotated with wire capacitance estimates. Our experience makes it clear that our methodology must be improved in this area. As with previous chip designs, we have attempted to build in a certain amount of “over-design,” where critical timing constraints must be met. In order to design the fastest circuits possible, we must determine how much of the available timing margin can be shaved away and still allow the chip to operate correctly. The gross inaccuracy of currently available Spice parameters for the MOSIS fabrication process adds greatly to our difficulties in this area. It is our hope that the modularity of the P**3 implementation will eventually make it possible to create large designs that are correct by construction.

8.3. Methodology Issues Although still ad-hoc in many respects, we feel that the P**3 methodology has much potential in the design of high-performance asynchronous systems and we will continue to evaluate and refine it. The experiment reported here is simple, but we have used the same methodology to design and simulate much more complex systems. The most complex design we have specified to date is a simplified counterflow pipeline. This required the use of additional functionality not mentioned in this paper such as arbitration, arbitrary data computations, and hierarchical specification. The successful transfer of our ideas to silicon gives us confidence that we can implement complex pipelined systems in a systematic way while maintaining high performance. The ease of assembling and modifying our standard cell layouts clearly illustrated the oft-touted modularity benefits of asynchronous design.

8.4. Future Work One of the biggest challenges we face in developing our methodology is putting it on a firmer theoretical foundation. We are in the process of developing formal semantics for the P**3 module set to allow us to use the expressive and analytical power of existing formalisms. We are also exploring formal models of the delay constraints that are required for correct operation. Our goal is to guarantee that for a given P**3 module set, arbitrary networks may be constructed without violating any timing constraints. This will be possible only if there is a tight coupling to the standard cell place and route mechanism. We currently carry out this task by hand, but there is no reason why commercial CAD tools could not be utilized. We plan to extend our methodology to design more complex systems and continue to improve the underlying module implementations.

9. References

[6] C. E. Molnar, I. W. Jones, W. S. Coates, and J. K. Lexau, “A FIFO Ring Performance Experiment,” Proc. of the Third International Symposium on Advanced Research in Asynchronous Circuits and Systems (Async97), pp. 279-289, April 1997. [7] J. Sparsø and J. Staunstrup, “Delay-insensitive Multi-ring Structures,” Integration, Vol. 15, No. 3, pp. 313-340, October 1993. [8] R. F. Sproull, I. E. Sutherland, and C. E. Molnar, “The Counterflow Pipeline Processor Architecture,” IEEE Design & Test of Computers, Vol. 11, No. 3, pp. 48-59, Fall 1994. [9] I. E. Sutherland, “Micropipelines,” Communications of the ACM Vol. 32, No. 6, pp. 720-738, June 1989. [10] K. van Berkel, J. Kessels, M. Roncken, R. Saeijs, and F. Schalij, “The VLSI-Programming Language Tangram and Its Translation into Handshake Circuits,” Proc. of the European Conference on Design Automation (EDAC), pp. 384-389, 1991.

[1] S. M. Burns and A. J. Martin, “Syntax-directed Translation of Concurrent Programs into Self-timed Circuits,” Proc. of the Fifth MIT Conference on Advanced Research in VLSI, pp. 35-50, 1988.

[11] K. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken and F. Schalij, “A Fully-Asynchronous Low-Power Error Corrector for the DCC Player,” IEEE Journal of SolidState Circuits, Vol. 29, No. 12, pp. 1429-1439, December 1994.

[2] W. A. Clark and C. E. Molnar, “Macromodular Computer Systems,” Computers in Biomedical Research, Vol IV, R. Stacy and B. Waxman eds., Academic Press, New York, 1974.

[12] Ted E. Williams, “Performance of Iterative Computation in Self-Timed Rings,” Journal of VLSI Signal Processing, February 1994.

[3] S. Furber, “Computing without Clocks: Micropipelining the ARM Processor,” in Asynchronous Digital Circuit Design, (G. Birtwistle and A. Davis, eds.), Workshops in Computing Series, pp. 211-262, Springer-Verlag, 1995. [4] S. Hauck, “Asynchronous Design Methodologies: An Overview,” Proceedings of the IEEE, Vol. 83, No. 1, pp. 6993, Jan 1995. [5] A. J. Martin, “The Design of a Delay-Insensitive Microprocessor: An Example of Circuit Synthesis by Program Transformation,” in Hardware Specification, Verification and Synthesis: Mathematical Aspects, M. Leeser and G. Brown, eds., Lecture Notes in Computer Science, vol. 408, pp. 244-259, Springer, 1989.

10. Acknowledgments Special thanks to Igor Benko for his work on the P**3 simulator and the design of our experiment and also to Jo Ebergen, Alex Ridgway and Willem Mallon for reviewing multiple drafts of this paper. The late Charles E. Molnar also provided invaluable technical guidance and inspiration.

Sun Microsystems, Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Figure 18: Chip Layout Plot

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.