Using simple tools to evaluate complex architectural trade-offs

June 15, 2017 | Autor: Armita Peymandoust | Categoria: Computer Hardware, Hardware Design, Case Study, Simulation Tool, Electrical And Electronic Engineering

Share Embed

Denunciar este link

Descrição do Produto

USING SIMPLE TOOLS TO EVALUATE COMPLEX ARCHITECTURAL TRADE-OFFS FIRST-YEAR GRADUATE DESIGN STUDENTS LEARN TO QUICKLY EVALUATE COMPLICATED COST-PERFORMANCE PROCESSOR TRADE-OFFS USING SIMPLE DESIGN TOOLS.

Michael J. Flynn Patrick Hung Armita Peymandoust Stanford University

On 15 April 1999, people around the world sent thousands of e-mails to our group at Stanford University inquiring about the “secret” performance report of the AMD K7 and Intel Coppermine chips. According to The Register Web site in the United Kingdom, there were rumors that some students at Stanford University had tested and compared the two chips using SPEC’s test suite.1,2 At that time, the AMD K7 (now known as Athlon) and the Intel Coppermine (now known as Pentium III) were not available to the general public, and it is easy to understand the excitement about this news. The real story was simpler. The students in our processor design class did not test the real chips. They estimated the cost and performance of the two chips as part of a case study. This study was based on information available from various public sources.3 As the chip implementation details had not been released to the public, the instructors assumed the hardware designs were similar to the simulator default configurations. After comparing the performance and the costs, the students used the simulator tools to design improvements to each chip.

Simple design tools Designing deep-submicron microprocessors is becoming an increasingly tedious and com-

0272-1732/00/$10.00  2000 IEEE

plicated process.4 It takes many years and many engineers to design a commercial processor chip. In an introductory computer architecture course, students are usually required to simulate a simplified pipelined microprocessor such as DLX5 using some hardware description language (typically Verilog or VHDL) or some graphical tool (such as HASE).6 While it is important for a student to understand how a basic processor operates, students may not fully appreciate the complexity and the various trade-offs involved in designing a processor in the commercial environment. Students cannot afford to write tens of thousands of lines of code to model processor microarchitecture; it is unproductive to ask them to deal with the myriad details at this stage. Instead, we focus on high-level issues involving cost (area) and performance (execution time). The main issues are cache size, cycle time, floating-point unit (FPU) area, latencies, branch strategy, and issue width. By emphasizing a few primary high-level issues, students gain a better understanding of the trade-offs involved in overall computer architecture design. Table 1 lists the architectural design tools available in our class. Students use these tools in conjunction with the class textbook,7 and they are also available for other researchers and

67

EDUCATION TRACK

Table 1. Stanford Architecture and Arithmetic Group design tools. Tools MXS simulator ABSS simulator` FUPA CacheOpt Silicon Design Tool Bus Occupancy Disk Design Pipelining Wave Pipelining

Description Simulate superscalar processor Simulate multiprocessor environment Calculate FPU area based on specification Calculate cache area and access time Calculate die cost Calculate memory bus occupancy Calculate disk access time and utilization Calculate optimum pipeline stages Calculate wave pipelining clock speed

URLs ftp://arith.stanford.edu/hung/mxs.sim ftp://arith.stanford.edu/abss/abss.v2_4.tar.gz http://umunhum.stanford.edu/tools/fupa.html http://umunhum.stanford.edu/tools/cachetools.html http://umunhum.stanford.edu/tools/area.html http://umunhum.stanford.edu/tools/bus.html http://umunhum.stanford.edu/tools/disk.html http://umunhum.stanford.edu/tools/opt-pipes.html http://umunhum.stanford.edu/tools/wave.html

students from the URLs listed in the table. We are still developing the currently available versions of these tools; they’re early stage alpha or beta versions. The table includes a short description of each design tool. MXS simulator, developed by James Bennett,8 and ABSS simulator, developed by Dwight Sunada,9 are execution-based performance simulators. MXS models stand-alone processors, while ABSS emulates multiprocessor environments. ABSS runs on the Solaris operating system, whereas MXS runs on Irix, Linux, and Solaris operating systems. Steve Fu10 developed FUPA and CacheOpt, Web-based design tools for calculating the areas and latencies for FPUs and caches. The Silicon Design tool is an analytical wafer cost estimation tool that uses a wafer defect model following a Poisson distribution. Bus Occupancy is a Web-based tool used to calculate the occupancy ratio of a multiprocessor memory bus. It uses a Markovian analytical model. The Disk Design Webbased tools calculate disk-related parameters such as disk access time and disk usage. The Web-based Pipelining tool helps estimate the optimum number of pipeline stages, while the Wave Pipelining Web-based tool helps calculate the maximum clock frequency of a wavepipelining system.

High-level architectural design Most commercial microprocessor designs are based on generations of implementations. Even these designs rely on an ad hoc process that produces a partition of the available die area among the functional blocks with the goal of minimizing total application execution time. Logic designers, circuit designers,

68

IEEE MICRO

and layout engineers then implement the architecture with design revisions due to functional flaws. At the end of the logic optimization phase, they arrange the functional blocks into a floorplan and either manually or automatically route their interconnections. Once all the modules are placed and routed, accurate die size, power consumption, and cycle time are generated, and the design is revised if the actual performance is unacceptable or the die size is too big. Figure 1 (next page) shows a simplified three-phase microprocessor design flow that requires many iterations at each level of design abstraction before the eventual convergence to the target die size, cycle time, and power consumption. If we follow the design flow outlined in Figure 1, the architectural modeling is completed and then implemented with a functional and behavioral specification. After functional simulation the floorplanning process begins. It is coupled with the physical design, which determines the die size (and cost), the power, and the cycle time. For an initial microprocessor design, as in the case of the student project, this can be an impossible process. Not only is the specification process intractable, but the design result —such as unacceptable cycle time, power, or cost—is determined late in the exercise. The alternative is to have highlevel processor specifications for the various functional components: cache, FPUs, and core processor. As shown in Figure 2, these tools allow an evaluation of the impact of functional unit size on cycle time as well as providing an estimate of some performance parameters (cache miss rates, weighted execution cycles for FPUs). The resultant data is useful for deter-

Revise if does not meet requirements

Architectural modeling

Performance and area estimation To simplify the design trade-off space, we partition the microprocessor die area into four functional pieces: core processor; caches; FPUs; and I/O pins, drivers, and buses. Each of these pieces has an area-performance tradeoff. The goal of the design exercise is to optimize the realized performance (for the specified benchmarks) under a total area constraint. While power is another important design parameter, the power density is preestablished in the early version of our tools and is not part of the design exercise. In our methodology, students can first use MXS to determine idealized performance (no cache misses, no floating-point delays). A more realistic picture of performance becomes visible, as the details of caches and FPUs are determined. By adjusting the issue width or the branch table size, students can also elaborate the core processor area.

Architectural optimization

Architecture specification

Design entry

Revise if functionally incorrect

Functional simulation

Circuit design

Logic synthesis

Gate optimization

Product requirements

Integration

Floorplanning Placement and routing Revise if does not meet power, cycle time, or die size targets

Physical optimization

mining cycle time and area but is less useful in determining the absolute performance as the processor is designed to be latency tolerant. However, students can enter the cycle counts and cache organization determined by the functional unit tools into our MXS simulator, which then more accurately determines the actual performance.

RC extraction

Die size? Power? Cycle time?

Figure 1. Typical microprocessor design flow.

MXS There are a number of techniques for estimating processor performance; the most common include the analytical method, a trace-based simulator, and execution-based simulation. The analytical method is the simplest and fastest way to estimate processor performance, whereas execution-based simulation is usually the slowest and the most detailed method. Obviously, there is a tradeoff between the level of detail and the simulation speed. In trace-driven simulations students use the output of an actual program to drive the simulation.11 Tracing tools can capture instruction streams as well as memory reference streams, and these streams can be used to evaluate processor performance. On the other hand, these traces may not capture all of the processor’s behavior. For example, speculatively generated loads may not show up in a memory reference trace.

High-level architectural design

Power estimation

Area estimation

Performance simulation

Evaluation

Figure 2. Iterative design process.

The application drives an execution-based simulation.8 These simulators read the appli-

JULY–AUGUST 2000

69

EDUCATION TRACK

first processes the benchmark. MXS then simulates the MXS compiler’s output. To achieve reasonable simulation speed, Processor and MXS uses a combination of cycle-by-cycle memory subsystem MXS compiler and event-driven simulation. In each cycle, configuration the simulator performs operations that repeat Compiled every cycle, such as instruction fetch, register benchmark renaming, instruction issue, and graduation. In addition, it also checks the event queue MXS simulator (work list) for operations that complete in that cycle. Results of When an instruction is fetched into the simulation instruction window, its registers are renamed, Figure 3. MXS simulation framework. and the instruction is added to the graduation queue. When dependencies of an instruction in the instruction window have been resolved, the instruction is issued to a funcLoad/ tional unit or the load/store store unit. Before a load or store Instructions queue instruction is issued, the simPrioritized ulator checks the load/store queue buffer to see if there is available room. The graduation queue maintains in-order graduation of instructions. Work list Cache Instruction Figure 4 shows a simplified processing access execution diagram of the simulator architecture. MXS implements a dynamic scheduling model that supports precise exceptions. The mechanism used to implement precise exception is similar to the techCache Bus nique used on the MIPS line busy transactions R10000 processor.12 It also Figure 4. MXS simulator architecture. The dashed lines represent the simulator execution supports speculative execution and branch prediction. loop; solid lines represent data flow. The branch prediction mechanism consists of a branch cation into memory and then simulate the prediction table, a return stack for handling execution of the processor. With this call/return pairs, and a table of branch approach, the processor and the memory sub- addresses for predicting the target address of system can be fully simulated, capturing the an indirect branch. interaction between memory and a dynamiMXS allows students to visualize the instruccally scheduled processor. tion stream, when it runs in the Irix operating MXS is an execution-based superscalar sim- system. For visualization, instructions are dividulator that can execute real programs and dis- ed into three categories: branch instructions, play the results. (The students are usually load and store instructions, and ALU operamore interested in executing programs than tions. Triangles, circles, and squares represent handling program traces.) Figure 3 shows the these three categories. Dependencies between simulation framework. The MXS compiler instructions are shown as lines between them, Benchmark (MIPS binary)

70

IEEE MICRO

with the dependent instruction placed below the instruction it depends on. Figure 5 shows the MXS visualization when executing the FFT benchmark. MXS has been incorporated into the SimOS simulator,13 which is a complete machine Figure 5. Visualization of FFT execution. environment. SimOS simulates both the hardware characteristics (processor architecture, disk seek suring critical-path gate delays. However, the time), and software characteristics (operating absence of any one of the four important system, application program) of a computer aspects (latency, die area, minimum feature system. size, and profile of applications) renders most of the comparisons unconvincing. FUPA’s FUPA strength is its ability to integrate the four With the increasing integration offered by aspects in a seamless way. technology scaling, microprocessor designers One important key aspect is the application have gone from software emulations of float- profile. Software applications are constantly ing-point (FP) operations to dedicated FP evolving, and microprocessor designers are chips to on-chip FPUs and, finally, to multi- constantly attempting to optimize the microple FPUs on a chip. At the same time, the processors to execute applications faster. latencies of most FP operations have shrunk FUPA uses the application profile to measure from hundreds of cycles to only a few cycles. the marginal utility of die area and design The need for fast FP executions is due in part effort for FPUs. For example, since square root to multimedia application requirements, but tends to be less frequent than additions and the allocation of the die area to FPUs remains multiplications, it is inappropriate to dedicate an art based solely on engineering intuition 40% of the FPU area to the fastest square root and past experience. The Floating-Point Unit algorithm. Effective latency (EL) is used to Cost Performance Analysis (FUPA) metric capture the application profile: supports quantitative trade-offs between pern EL = CycleTime ⋅ OPi− latency ⋅ OPi− distribution formance and cost. i =1 FPU design requires the underlying technology to meet the computation and communication complexity of the algorithm. Here OPi-latency is the number of cycles taken to From a cost perspective, the designer sets the perform each FP instruction, OP i-distribution is the floorplans of the available die area and divides dynamic distribution of each FP instruction, the power budget by considering the perfor- and CycleTime is the processor cycle time. mance benefit of allocating more die area to a To compare different FPU implementations specific operation. FUPA integrates both cost properly with the FUPA metric, the metric and performance into simple and intuitive should be independent of the process techformulas for determining the optimality of an nology. Advancing process technology increasFPU design. Consequently, FUPA enables the es circuit, power, and wire densities, and lowers first quantitative comparison of microproces- intrinsic gate-switching times. If comparisons sor FPUs. are made between implementations of differWithout a metric like FUPA, the task of ent process technologies, there is no obvious comparing FP architectures and implemen- way to distinguish between technology tations is a very difficult one. Previous work improvements and design improvements. We made comparisons based on 1) the algorith- can calculate delay and area scale factors of a mic level by considering the number of exe- process technology based on its minimum feacution cycles and 2) the circuit level by ture size. FUPA uses these scale factors to norcounting the number of transistors and mea- malize the area and delay of FPUs.

States:

Waiting Executing Ready Not ready

Action:

Operation Load/store

∑

JULY–AUGUST 2000

71

EDUCATION TRACK

parasitic components in the cache design. The tool can Latency for FP Adds [cc] > 1 Latency for FP Adds [%] then estimate the access and the cache cycle times. Latency for FP Multiplies [cc] > 1 Latency for FP Multiplies [%] Students can experiment Latency for FP Divides [cc] > 15 Latency for FP Divides [%] with various cache organizaDrawn Feature Size Effective Feature Size tions to find the one that best fits into the cycle time and power requirements of the Reset Values Evaluate Design Area = mm^2 CPIloss design budget. CacheOpt also provides the expected Figure 6. Floating-point unit area. miss rate (the DTMR, or design target miss rate) based on a simple table lookup of With the components of FUPA in place, stored cache data. It corrects for the number the following equations summarize the com- of instructions between task switch (quantum putation of FUPA, defining normalized area length) and instruction set code density (as and effective latency: for Intel versus MIPS processors). This later information simply serves as a guideline for N Area = Area /Area sf the cache designer, since MXS can access the performance of the cache organization more accurately. EL NEL = Delay sf Coppermine versus K7: A case study Processor clock cycle time [ns]

FUPA =

(NEL )(N Area) 100

Here, Areasf and Delaysf are area and delay scale factors of the given process technology, and Area is the die area of the FPU excluding the register file. NArea is the normalized die area, and NEL is the normalized effective latency. FUPA includes all four key aspects of FPU design: latency, die area, process technology, and application profile. The interpretation of the FUPA is very simple: a lower FUPA corresponds to a more cost-effective implementation. We developed a Web-based FPU design tool based on the FUPA metric, as shown in Figure 6. The Web site keeps a database of normalized area and effective latencies of the FPU functional blocks. Students can calculate the FPU area and the CPI loss using the FPU latencies and distribution of operations.

CacheOpt CacheOpt lets students lay out a trial cache by defining a cache organization. Figure 7 (next page) shows the Web interface of CacheOpt. Based on this trial layout, students can estimate the cache area and extract the

72

IEEE MICRO

Early in 1999 students used these tools to evaluate two competing processors, Intel’s Coppermine and AMD’s K7. As there was not enough public information available about the two processors, this was not a complete comparison. However, the purpose of this project was to show students how design parameters of a commercial processor could be determined while considering the possible trade-offs. First, the students compared the performance of the two chips. They then scaled the AMD chip from a 0.25-micron process to a 0.18-micron process and tried to change some design parameters to outperform the Intel chip. They based the simulator parameters of the two processors on public information,3 and substituted some implementation details with the MXS default configurations. The simulator ran on three SPEC benchmarks and one alternative benchmark provided to the students. The SPEC benchmarks did not exercise much cache memory, but the alternative benchmark was designed to exercise the caches. Figure 8 shows MXS in action when running the Uncompress, espresso, and fft benchmarks. After the simulation, MXS reported the IPC (instructions per cycle), level-one cache

miss rates, bus usage, branch miss rate, number of fetch stalls, number of issue stalls, and other important simulation statistics. At this point, the students realized that the performance comparison depended very much on the benchmark program. Coppermine performed better than the K7 on the SPEC benchmarks, but K7 ran better on the alternative benchmark. SPEC benchmarks have become the standard way of reporting performance, but it is possible to tune a processor to these benchmarks. To come up with benchmark-independent performance numbers, we asked the students to calculate IPC and bus utilization based on the Design Target Miss Rates (DTMR).7 Using the Webbased tools, students could easily discover the adjusted DTMR, the cache miss rate, cache area, and access time. At this point, the students had considered only the uniprocessor environment. The next step was to evaluate multiple processors sharing the same memory bus. The students calculated the relative performance loss caused by bus contention using the Web-based Bus Occupancy Tool. At this point, students considered cost (or area). A figure of merit of a design was defined as Cost × Adjusted CPI × Cycle Time. Given the wafer cost, testing cost, packaging cost, feature size, and core area of each processor, we asked the students to estimate the costs of the two processors using FUPA and CacheOpt. The K7 level-2 cache was assumed to be off

The cache is

way set associative.

Cache Size in KBytes: Line Size in Bytes Memory Bus Transfer Unit [Bytes]: # of Read/Write Ports: Write Hit Policy: Write Through Write Back Cell Type: 6T Cell 4T Cell R/M

ISA: R+M

L/S

Quantum Length:

Instructions

Cache Type Data

Unified

Level of Multiprogramming Technology: Feature Size in microns : f =

Reset Values

Access Latency

Estimate Area

Miss Rate

Figure 7. Cache design tools (CacheOpt).

Figure 8. MXS’s report window showing the processor’s statistics.

JULY–AUGUST 2000

73

EDUCATION TRACK

chip, but since the Coppermine cache was on chip, the K7 packaging cost was higher. Intel manufactured Coppermine using a 0.18micron process, while AMD manufactured the K7 using a 0.25-micron process. There was a big difference in cost and performance between the two technology processes. After calculating the figure of merit for the two processors, students scaled the K7 to a 0.18-micron process. They then explored the new design space available for the K7. Some possible architectural changes include increasing the reorder buffer size, improving the FPU latencies, changing the cache sizes, and having an on-chip L2.

I

n more advanced processor design courses, the problem is to create interesting design trade-off studies without students having to enter innumerable processor specifications and functional and behavioral details. Presumably students have seen these lower-level functional specifications in earlier courses; now the extent of the specification is a daunting hurdle to trade-off studies. The tools that hide some of these details are especially useful, so long as essential parameters are still available. By creating a set of default values coupled with a relatively robust base processor design, the MXS simulator can perform an execution-based simulation of benchmark programs. Data collected from simulator runs can guide students to optimize parameters such as issue width, branch table size and structure, and cache-memory organization. Other tools such as CacheOpt can support the cache design effort. From a course perspective, the issue is to find cost (area) and performance design points that are optimum for particular benchmark(s). The problem is usually expressed as one in which the area is fixed, and the student must find the best allocation of area between cache and floating-point units so that performance is optimized. In carrying out these studies, students must actively manage the cycle time. As cache sizes become large, the tools must recognize that multicycle access may be required. In the end, the project value depends on the comprehensiveness of the alternatives studied and the resulting processor performance. As processors become increasingly complex, tools such as those outlined here

74

IEEE MICRO

become essential for students to have an interesting yet manageable design experience. MICRO References 1. M. Magee, “Intel Will Pay K for AMD’s K7,” http://www.theregister.co.uk. 2. M. Magee, “Raw Coppermine, K7 Benchmarks Found in School,” http://www. theregister.co.uk. 3. Microprocessor Report, various issues, 1994-99. 4. M.J. Flynn, P. Hung, and K.W. Rudd, “DeepSubmicron Microprocessor Design Issues,” IEEE Micro, Vol. 19, No. 4, July/Aug. 1999, pp. 11-22. 5. J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann, San Mateo, Calif., 1996. 6. P.S. Coe et al., “A Hierarchical Computer Architecture Design and Simulation Environment,” ACM trans. Modeling and Computer Simulation (TOMACS), Vol. 8, No. 4, 1998, pp. 431-446. 7. M.J. Flynn, Computer Architecture Pipelined and Parallel Processor Design, Jones and Bartlett Publishers, Boston, 1995. 8. J.E. Bennett, Latency Tolerant Architectures, PhD thesis, Computer Science Dept., Stanford University, Stanford, Calif., 1998. 9. D. Sunada, D. Glasco, and M. Flynn, “ABSS v2.0: a SPARC Simulator,” Tech. Report CSL-TR-98-755, Computer Science Dept., Stanford University, 1998. 10. S. Fu, Cost Performance Optimization of Microprocessors, PhD thesis, Computer Science Dept., Stanford University, 1999. 11. R. Uhlig and T. Mudge, “Trace-Driven Memory Simulation: A Survey,” ACM Computing Surveys, Vol. 29, June 1997, pp. 128-170. 12. K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, Mar.-Apr. 1996, pp. 28-40. 13. M. Rosenblum et al., “Complete Computer System Simulation: The SIMOS Approach,” IEEE Parallel Distributed Technology: Systems Applications, Vol. 3, 1995, pp. 34-43.

Michael J. Flynn is professor of electrical engineering at Stanford University. He was founding chair of both the ACM Special Interest Group on Computer Architecture and the IEEE Computer Society’s Technical Com-

mittee on Computer Architecture. Flynn received a PhD from Purdue University. He was the 1992 recipient of the ACM/IEEE Eckert-Mauchley Award and the 1995 recipient of the IEEE Computer Society’s Harry Goode Memorial award. Patrick Hung is a PhD candidate in the Stanford Architecture and Arithmetic Group at Stanford University. His research interests include computer arithmetic, microprocessor architecture, and deep-submicron CAD tool design. Hung received a BSEE from the University of Hong Kong and an MSEE from Stanford University, and anticipates receiving a PhD in 2000. He is a student member of the IEEE.

Armita Peymandoust is a PhD candidate in the Electrical Engineering Department at Stanford University. Previously, she held a design engineer position on the IA-64 product line at Intel Corporation. Her research interests include system-level design and synthesis, hardware/software codesign, and design reuse. Peymandoust received a BSEE from the University of Tehran and an MSEE from Northeastern University.

Send questions concerning this article to Patrick Hung, Stanford University, Gates 332, 353 Serra Mall, Stanford, CA 94305; [email protected].

IEEE AN D

A P P L I CAT I O N S

2000 Editorial Calendar January/Februrary

Vision 2000

March/April

Computer Graphics Applications

May/June

Off the Desktop

July/August

... and onto the Wall

September/October

Visualization

November/December

Virtual Reality

http://computer.org/cga

JULY–AUGUST 2000

75

Lihat lebih banyak...

Using simple tools to evaluate complex architectural trade-offs

Descrição do Produto

Comentários