IBM System z10 processor cache subsystem microarchitecture

June 5, 2017 | Autor: Pak-kin Mak | Categoria: Information Systems, Computer Software

Descrição do Produto

IBM System z10 processor cache subsystem microarchitecture

P. Mak C. R. Walters G. E. Strait

With the introduction of the high-frequency IBM System z10e processor design, a new, robust cache hierarchy was needed to enable up to 80 of these processors aggregated into a tightly coupled symmetric multiprocessor (SMP) system to reach their performance potential. Typically, each time the processor frequency increases by a signiﬁcant factor, as did the z10e processor over the predecessor IBM System z9t processor, the access time of data, as measured by the number of processor cycles beyond the level 1 cache on an identical processor cache subsystem, would increase proportionally as well because the ﬂight time on the chip interconnects across multiple hardware packaging levels has stayed relatively constant in nanoseconds. To address the latency scaling problem and the increased demand of the larger 80-way SMP size, the z10 processor cache subsystem introduces new innovative concepts and solutions.

Introduction The IBM System z10* platform, the latest IBM System z* enterprise server, features up to 80 processors operating at 4.4 GHz [1], 1.5 TB of physical memory, and 32 I/O hubs to provide further increases in single-processor thread performance and system-level performance over the predecessor z9* system. These design points were responsible for the new extendable processor cache subsystem design of the z10* and future System z processors. The architecture of the processor cache subsystem is traditionally deﬁned by analyzing performance simulation results of a set of hardware traces collected on prior systems running Large Systems Performance Reference (LSPR) workloads [2]. From these simulation results, we can observe the caching dynamics of processor activities. Some of the key caching dynamics include high data sharing, high exchange rate of exclusive data ownership, and high sensitivity to long access latency. Given these dynamics, the following design points were ultimately set for the z10 cache subsystem: Fully connected topology for multiple processor unit

(PU) books, to keep oﬀ-book data access latency

short. (A PU book is the physical packaging method used to contain such elements as system processors, cache, memory cards, and I/O connectors.) Within each PU book, a three-level cache hierarchy to improve cache hits. The ﬁrst two levels of caches (L1 and L1.5) are private to each processor, and the toplevel cache (L2) is a large centralized 48-MB shared cache. Integrated L2 cache with a PU-book-level crossbar switch; both are controlled by the system coherency management function. While the large centralized shared cache approach is not novel for System z servers [3–5], it does remain quite distinct from other server oﬀerings in the industry, such as the Sun Fire** E25K [6], HP Integrity Superdome** servers [7], and IBM POWER6* microprocessor-based systems [8], which have multilevel caches but without a large centralized shared cache. Thus, this paper focuses mainly on the cache management and coherency protocol between each of the large centralized L2 shared caches. System z family designs prior to the z10 server primarily had an on-chip ﬁrst-level cache that was private to that processor and an oﬀ-chip second-level cache

Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the ﬁrst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/09/$5.00 ª 2009 IBM

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

P. MAK ET AL.

2:1

z9 64-PU system

z10 80-PU system

2 PUs 2 PUs 2 PUs 2 PUs

2 PUs 2 PUs 2 PUs 2 PUs

4 PUs 4 PUs 4 PUs

4 PUs 4 PUs 4 PUs

L2

L2

L2

L2

2 PUs 2 PUs 2 PUs 2 PUs

2 PUs 2 PUs 2 PUs 2 PUs

4 PUs 4 PUs

4 PUs 4 PUs

4 PUs 4 PUs

4 PUs 4 PUs

L2

L2

4 PUs 4 PUs 4 PUs

4 PUs 4 PUs 4 PUs

Connected by two unidirectional concentric rings 2 PUs 2 PUs 2 PUs 2 PUs

2 PUs 2 PUs 2 PUs 2 PUs

L2

L2

2 PUs 2 PUs 2 PUs 2 PUs

2 PUs 2 PUs 2 PUs 2 PUs Memory

CP CP CP CP CP CP CP CP 2 PUs 2 PUs 2 PUs 2 PUs 2 PUs 2 PUs 2 PUs 2 PUs

SCD

SCD

SCC

SCD

Memory

Memory Memory 2 GX⫹⫹ 2 GX⫹⫹ 2 GX⫹⫹

CP CP CP CP CP 4 PUs 4 PUs 4 PUs 4 PUs 4 PUs 4 ⫻ 3-MB L1.5 4 ⫻ 3-MB L1.5 4 ⫻ 3-MB L1.5 4 ⫻ 3-MB L1.5 4 ⫻ 3-MB L1.5 COP COP COP COP COP MC, GX MC, GX MC, GX MC, GX MC, GX

SCD

SC MSC

4 GX⫹ Off-book interconnect

Memory

2 GX⫹⫹

SC

MSC

4 GX⫹ Off-book Memory interconnect

Off-book interconnect

Off-book interconnect

Off-book interconnect

Figure 1 Comparison of z9 and z10 processor cache subsystem structures. (PU: processor unit; CP central processor; COP: coprocessor; GX: internal bus connecting I/O hub cards; MC: memory controller; SCD: system control data; MSC: main storage controller; SC: system controller.)

shared by a subset of the processors in the system [3, 4, 6]. On the latest tenth CMOS (complementary metal-oxide semiconductor) generation (the ‘‘10’’ in ‘‘z10’’) processor subsystem, a new three-level cache hierarchy is featured by introducing an on-chip processor-private second-level (L1.5) cache design [9] and making the oﬀ-chip sharedlevel cache now a third-level cache, though it is still denoted as the L2 cache. Because the shared L2 cache is now farther from the processor in both distance and processor cycles, the new L1.5 oﬀers a performance beneﬁt by being a relatively low-latency large cache on the same chip die and is physically located next to the processor and L1 cache. In addition to a new cache hierarchy, innovative cache management techniques across the entire z10 cache hierarchy were introduced, as were enhancements to the

2:2

P. MAK ET AL.

processor subsystem topology and protocol. These improvements aimed to boost performance by reducing access latencies. To aﬃrm the z10 processor cache subsystem design, it was observed from actual hardware performance measurements running LSPR workloads that approximately 90% of all L1 miss fetch requests entering the cache subsystem were satisﬁed by the L1.5 and L2 caches. It was further observed that the system buses typically maintain utilization rates below 30%, and queuing delays within the system did not have a noticeable impact on performance.

Processor cache subsystem organization A comparison of the System z10 and System z9* processor cache subsystems is shown in Figure 1,

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

displaying the diﬀerences in the interbook and intrabook topologies. Both processor cache subsystems have four PU books in their maximum conﬁguration and they have the same PU book elements, but they are integrated diﬀerently across the chip dies. Some of these elements make up the processor subsystem and include some number of processors, a system controller element (SCE) that manages system-wide coherency and the large centralized shared cache, some number of memory controllers, and some number of I/O controllers. With the availability of 65-nm technology, it became possible on the z10 design to integrate all of the processor subsystem within the PU book onto two chips—the CP (central processor) and the SC (system controller), which includes both the L2 cache and storage controller functions. This is in comparison to the ﬁve chips in the z9 design: the CP, the SCC (system control chip), SCD (system control data), the MSC (main storage controller), and the clock. To balance the number of chip signals between the CP and SC chips, the I/O controller and MC (memory controller) chips were moved from the z9 MSC chip to the z10 CP chip. The z9 SCC and SCD chips were combined into a single z10 SC chip. The clock chip function is also now distributed across the CP and SC chips. Reducing the PU book elements to a two-chip design provided two main beneﬁts: data that moves over fewer chip crossings and savings in overall development expense and bill of materials cost. The two SC chips and ﬁve CP chips reside on a single 95-mm 3 95-mm glass ceramic multichip module (MCM). The MCM, along with 48 pluggable DIMMs (dual inline memory modules) and 8 I/O card slots all reside on the PU book. These PU books plug into a passive backplane, a board that serves to interconnect the PU books without using any active circuitry. A system can be conﬁgured with up to four PU books plugged in. The z10 processor cache subsystem has up to four PU books that are joined in a fully connected topology, which, from a comparative performance standpoint, provides better average cache intervention latencies from the L2 cache of a remote PU book than the z9 dual-ring topology. One important distinction between the z9 and z990 [3] and the z10 multibook interconnect topology is that in conﬁgurations with less than four PU books, the z9 and z990 dual-ring topology required at least one jumper or passive card to bridge communications and dataﬂow between two PU books in diagonal positions. The z10 fully connected topology eliminates the jumper card requirement. The heart of the processor cache subsystem is the SCE function that resides mainly on each of the SC chips and is primarily responsible for providing the following important functions:

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

Snoop fabric interface

Snoop fabric interface Four L2s

Four L2s

48-MB shared L2

40-MB shared L2

L1

L1

16 L1s L1

L1: 256-KB I + 256-KB D 4-way set-associative 256-B cache-line size

L2: 40 MB inclusive of L1 20-way set-associative 256-B cache-line size z9

L1 L1.5

20 L1/ L1 L1 L1.5 L1.5s L1.5

Storethrough L1 parity protected

L1: 64-KB I + 128-KB D 8-way set-associative D L1 4-way set-associative I L1 256-B cache-line size

Storethrough L1.5 ECC protected

L1.5: 3 MB inclusive of L1 12-way set-associative 256-B cache-line size

Store-in L2 ECC protected

L2: 48 MB inclusive of L1/L1.5 24-way set-associative 256-B cache-line size z10

Figure 2 Comparison of z9 and z10 processor cache subsystem structures. (I: instruction cache; D: data cache; ECC: error-correcting code.)

Interconnect 20 processors on a single PU book into a

coherent system. Interconnect physically and coherently the four PU

books of 20 processors each by way of a fully connecting point-to-point set of fabric buses. Contain the L2 cache, which is shared by up to 20 processors residing within the same PU book. Present a uniﬁed single image of system main memory to the operating system (OS). Provide system-level coherency functions for data sharing between processors and I/O in a strongly consistent system architecture and ensuring that the processors are always working with the most recent copy of data. Provide a conduit for interprocessor and processor– I/O communications.

Cache memory hierarchy Comparison between the z9 and z10 cache hierarchies is shown in Figure 2. The z10 cache hierarchy consists of the private split 64-KB L1 instruction cache (I-cache) and 128-KB L1 data cache (D-cache), the private 3-MB L1.5 cache, a book-level 48-MB shared L2 cache, and support for up to 1.5 TB of distributed main memory in a fully

P. MAK ET AL.

2:3

Table 1 Comparison of z9 and z10 bus bandwidths. System z9

System z10

Bus width and speed

Peak bandwidth

Bus width and speed

CP data into SC chip

32 B @ 1.7 Gb/s

54.4 GB/s

16 B @ 2.933 Gb/s

47 GB/s

CP data from SC chip

32 B @ 1.7 Gb/s

54.4 GB/s

16 B @ 2.933 Gb/s

47 GB/s

Memory data into a PU book

32 B @ 1 Gb/s

32 GB/s

32 B @ 2.132 Gb/s (DDR2 533 MHz)

68.3 GB/s

Memory bandwidth from a PU book

32 B @ 1 Gb/s

32 GB/s (data only)

16 B @ 2.132 Gb/s (DDR2 533 MHz)

34.15 GB/s (address and data)

Fabric bandwidth into a PU book

32 B @ 850 Mb/s

27.2 GB/s (data only)

48 B @ 1.466 Gb/s

70.3 GB/s (address and data)

Fabric bandwidth from a PU book

32 B @ 850 Mb/s

27.2 GB/s (data only)

48 B @ 1.466 Gb/s

70.3 GB/s (address and data)

conﬁgured four-PU-book system. The L1 and L1.5 caches are considered part of the processor domain and are described in more detail by Shum et al. [9], in this issue. As in prior systems, the cacheable unit of memory data throughout the entire z10 cache hierarchy continues to be 256 bytes. The z10 L1 cache retains the traditional store-through policy adopted since the IBM ES9000* H2 system for improved error recovery. In light of the new cache hierarchy, the processor memory updates are now immediately performed in the L1 data cache, then in the new L1.5 cache, and ﬁnally in the L2 cache, protected by error-correcting code (ECC), all in an atomic manner in order to ensure correct order of memory updates. This enhanced store-through policy is integral to maintaining robust z10 hardware reliability because a hardware error at the processor, L1, or L1.5 can be tolerated without loss of data integrity. In our implementation of the strongly consistent z/Architecture*, the processor and L1 manage data consistency; that is, a processor acquires exclusive ownership of the memory address as a prerequisite to updating the data. Once an update is complete, other processors may then obtain the updated data by requesting that the storing processor relinquish exclusive ownership. The storing processor then completes its updates and allows the data to change from an exclusive state to a read-only state (in which it can be shared) and allows other processors to only inspect the data. Future updates will then require the new storing processor to ﬁrst obtain exclusive ownership, a process that involves invalidating all shared copies held by other processors. Thus, updates to the same location happen sequentially and cumulatively in order to ensure that no update is lost and, at the conclusion, the accumulation of all updates

2:4

P. MAK ET AL.

Peak bandwidth

will have adhered to the programming order as observed by each processor. L1 and L1.5 The L1 is a split cache design composed of a 4-way associative 64-KB I-cache and an 8-way associative 128-KB D-cache. The size of the L1 cache has been reduced in comparison to the z9 L1 cache in order to limit the access time to within the short cycle time of the higher frequency processor. The z10 L1.5 cache, which is new, is an intermediate level of uniﬁed I- and D-cache, private to the processor. It is situated between the L1 cache and the shared L2 cache. Its data content is inclusive of the data held in the L1 I-cache and D-cache. This design allows the SCE function to manage coherency directly with the L1.5, and the L1.5 then further manages coherency with the split L1 caches. The 3-MB L1.5 is 12-way set-associative and is logically organized into two address-based slices, with each slice communicating to one SC chip on the PU book. SCE and L2 Each SC chip contains a coherently managed dualpipelined 24-MB L2 cache, interconnections of up to ﬁve quad-core CP chips, and interconnections of up to three additional PU books, producing the system PU book building block shown in Figure 1. The SCE shared L2 cache is used for caching data requested by processors and, to a lesser extent, by I/O devices. On direct memory access (DMA) write operations in which the storing length does not align with memory operation lengths, the data is brought into the L2 to be merged with the partial write. It is then left in the L2 without a forced writeback to memory. The L2 cache maintains an inclusive rule with the lower-level L1 and L1.5 caches that reside within the

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

same PU book; this means that the L2 contains a copy of every modiﬁed line held in the L1 and L1.5 caches as well as a copy of all of the shared cached lines. With precise ownership tracking of data residing in the local L1.5 caches, the L2 will broadcast only cross-interrogations to the L1.5 and L1 when the requested data exists there. A cross-interrogation is a protocol between the L2 and L1.5 for managing cache coherency. If the address of the requested data misses in the L2, the L2 would then simply block the transmission of the cross-interrogation knowing that the data cannot possibly exist in the L1 or L1.5. This particular aspect of an inclusive cache policy oﬀers a performance beneﬁt. For example, when a crossinterrogation is sent to one or more L1s or L1.5s, it consumes pipeline bandwidth to ﬁrst search the location of the cross-interrogation address. A second pipe pass is then made to perform the necessary cache management action of either invalidating the address from the cache or demoting the ownership status from exclusive to readonly, depending on the original request type that generated the cross-interrogation. These necessary pipe passes for processing cross-interrogations can potentially interfere with cache-hit accesses for data that is needed by the processor for execution or instruction decoding. Without an inclusive cache management policy, the L2 would not know what data exists in the L1 or L1.5 caches and would, therefore, be required to broadcast crossinterrogations generated by exclusive type accesses to every processor and L1.5 in the system, except for the initiator of the exclusive type access. In a large symmetric multiprocessor (SMP) system, such as the z10 80processor design with a large number of I/O attachments, the cross-interrogation rate can overrun the processor and the L1.5 pipelines and reduce the processor cache subsystem hardware performance. The L2 cache employs a store-in policy, which means that processor memory updates are kept in the L2 and are not immediately stored through to main memory until the cache slot containing the modiﬁed data needs to be evicted in order to make room for some other data being accessed by a processor or I/O. This store to main memory from a cache eviction is called a cast-out of aged data and is triggered by a least recently used (LRU) replacement event. A feature of the z10 design is that it permits LRU cast-outs to be stored in the target memory PU book L2 cache if a slot is available. This feature is covered in more detail in the section ‘‘SCE cache management and protocol.’’ Organizationally, the combined 48 MB of shared L2 on the pair of SC chips on a PU book is 24-way setassociative and has four address-based processing pipes for parallel processing of processor and I/O DMA operations. Memory addresses are mapped on contiguous 256-B addresses across these four L2 pipes so that no

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

Table 2 Comparison of z9 and z10 access times, in processor cycles and nanoseconds. (DNA: does not apply.) z9 dual-ring z10 fully connected topology topology (cycles, ns) (cycles, ns) DNA

13, 2.9

On-PU-book L2

31.5, 18.5

88, 20

Off-PU-book L2, adjacent

97.5, 57.3

229, 52

Off-PU-book L2, diagonal 153.5, 90.3

DNA

L1.5

On-PU-book memory

199.5, 117.5

591, 134.3

Off-PU-book memory, adjacent

261.5, 153.8

712, 161.8

Off-PU-book memory, diagonal

319.5, 187.9

DNA

cache coherency management is necessary between the pipes. Main memory The main memory resides in the memory subsystem and holds data that is addressable by both processors and I/O devices. The z10 memory subsystem is physically made up of 48 pluggable DIMMs within each PU book. With four PU books and an aggregate of 192 DIMMs, the design supports a maximum physical address space of 1.5 TB when 8-GB DIMMs are used. Each CP chip contains a memory controller that interfaces directly to four parallel channels of up to a three-deep cascade (daisy-chaining) of proprietary pluggable buﬀered DIMMs. At the MCM level, only four of the ﬁve mounted CP chips have physical connections to memory channels. The z10 memory subsystem— speciﬁcally, the proprietary buﬀered DIMMs—is the same as the one employed on the POWER6 microprocessor memory subsystem [8]. Speed and feeds To get a relative sense of the z10 capabilities, a comparison with z9 is shown in Table 1 and Table 2. From the bandwidth comparison in Table 1, the z10 shows a general improvement over z9 in areas where headroom is needed to support the z10 larger SMP size. The z10 CP–SC interface shows a reduction in bandwidth capacity that results from the inclusion of the L1.5 cache level on the CP chip, which reduces the number of misses out of the chip. Table 2 highlights the challenges that are facing processor cache subsystem designers. From the table, note that the access time of the ﬁrst level of the oﬀ-chip cache (both the z9 and the z10 L2 cache) has grown signiﬁcantly in processor cycles because of the high-

P. MAK ET AL.

2:5

I/O subsystem eSTI IB links eSTI IB links eSTI IB links eSTI IB links eSTI IB links eSTI IB links

I/O hub 0

I/O subsystem Memory subsystem

I/O hub 1 I/O hub 2 I/O hub 3 I/O hub 4

I/O hub 6 I/O hub 7

CP0 CP3 SC0

eSTI IB links

DIMMs

CP1

I/O hub 5

eSTI IB links

SC1 CP4

DIMMs

CP2 DIMMs Multichip module DIMMs

Figure 3 Connectivity from the multichip module to the I/O and memory subsystems. (eSTI: enhanced self-timed interface; IB: InﬁniBand**; CP: central processor; SC: system controller.)

frequency processor design, but they are very similar when the processor cycles are converted into nanoseconds. Furthermore, the number of processor cycles to access a remote PU book shared L2 cache has also grown signiﬁcantly, but because of the z10 fully connected topology, the access time in nanoseconds is actually less. The z10 memory access time did not improve over the z9 memory access time, primarily for two reasons: First, the memory subsystem has to be operated within the deliverable power and thermal constraints when a large number of DIMMs are installed, and second, the change in packaging in the memory subsystem structure from memory PU book to pluggable DIMMs necessitated (for the ﬁrst time on a System z server) the serial cascading of the DIMMs. This imposed an access time penalty on operations beyond the ﬁrst DIMM in the daisy chain. The beneﬁt of a pluggable DIMM design, however, is that it provides ﬁner granularity of installed memory capacity than is available with a memory PU book, which, for cost reasons, had a limited number of preconﬁgured sizes. Nevertheless, the memory access time of a fully conﬁgured System z10 server compares favorably with the time of up to 440 ns in Sun Fire E25K Systems [6] and 395 ns in HP Integrity Superdome servers [7]. From a memory performance standpoint, the z10 server maintains a strong advantage over oﬀerings from other high-end server vendors.

2:6

P. MAK ET AL.

The z10 I/O controller is built on the architectural foundation of the z9 I/O controller and is compatible with the same I/O hub chips used by POWER6 processorbased systems to facilitate component sharing and connectivity with industry-standard I/O interfaces. Support is maintained for traditional System z I/O and for InﬁniBand and PCI Express** (PCIe**). The z9 I/O controller base design was modiﬁed for z10 to relocate the function to the CP chip in order to ﬁt the z10 physical package structure and to add capabilities. The additional capabilities include support for the current-generation I/O hub chip connected via a more advanced, higherspeed interface and a modiﬁed and extended instruction set that supports new functions, including interrupt vectors in memory that were previously held in the I/O hub chip. Further extensions include compliance with command ordering requirements of industry-standard I/O and increased performance. The z10 I/O controller supports two interfaces per CP chip, each connecting to one of the supported I/O hub chip types, in any combination. Each interface may operate at half the processor frequency to support the current-generation I/O hub chip or at lower ratios of processor frequency to support previous-generation I/O hub chips. To support reliability and concurrent maintenance and upgrades, each interface may be stopped, started, and reconﬁgured concurrent with normal system operation. Figure 3 shows the connectivity from the MCM (housing the ﬁve CP chips and two SC chips) to the I/O subsystem and to the memory subsystem. There are up to eight links that are coming out from the MCM that can connect to various I/O hub chips. The CP chip-to-I/O hub interface is a proprietary protocol supporting several functions over a single interface. The primary function is high-throughput DMA to memory from the I/O hub chips, including two read sizes and variable write sizes with byte granularity and individual bit set and reset capability used to update interrupt vectors. Special locking commands are also supported. The I/O hub interface allows all I/O devices to participate in the SMP memory coherency and supports all standards, including store protect keys. The I/O DMA interface handles eight reads and eight writes simultaneously for high throughput, maintaining order as required by bus protocol rules for parallel commands. The SCE function provides dedicated state machines to handle I/O requests on behalf of the I/O controller. In the write direction, the data buﬀers are a shared resource pool and may be combined or assigned individually to eﬃciently handle both 128-B and 256-B line sizes. In the read direction, there are shared hardware data-transfer buﬀers in the I/O controller and additional data buﬀers in

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

the L2 cache to maximize throughput with a minimum of hardware buﬀers. These buﬀers provide temporary storage for data in transit. The I/O hub interface also supports an outbound load/store protocol (LSP) that is primarily used for passing command and control information. Following initiation by the LSP, large block data moves are driven by the I/O hub chip utilizing the higher-bandwidth parallel processing capability of the DMA commands. In addition, the I/O hub interface uses ECC for reliability and multiple error-reporting mechanisms to ensure data integrity. The backside of the I/O hub chips supports industrystandard and other IBM-proprietary protocols [10]. Some of the supported I/O hub chips and the MBA (memory bus adapter) I/O hub were carried over from the previous-generation z9 system. The I/O hub chip converts the bus interface from the processor subsystem to a modiﬁed InﬁniBand 123 double-data-rate (DDR) standard. This InﬁniBand interface may, in turn, connect to an InﬁniBand MBA switch chip for I/O expansion and directly to IBM Parallel Sysplex* via a new Parallel Sysplex using InﬁniBand (PSIFB) protocol [10]. The MBA hub chip connects the bus interface from the processor subsystem to an enhanced self-timed interface (eSTI) [11]. This is used for sysplex coupling to priorgeneration systems with proprietary link protocols used previously.

SCE cache management and protocol The z10 cache management scheme builds on prior System z mainframe cache management algorithms derived from the MOESI (modiﬁed, owned, exclusive, shared, invalid) protocol [12] while introducing a number of innovative states designed to hide system latency and improve system performance. From prior System z cache management algorithms, the intervention master (IM) and multicopy (MC) cache directory states have been maintained, along with the target memory PU book and memory master (MM) concepts. Within the z10 design are several new concepts, including the local change (LC), I/O reservation, and I/O lock directory states, as well as a subset cache-line ownership algorithm and a number of new software synergy instructions. All of these concepts help to increase system performance, reduce visible system latency, and reduce the amount of system hardware dedicated to special operation and request handling. Table 3 presents the complete list of cache ownership states in the z10 L2 cache. While this table covers the ownership states, it does not fully address the subset processor cache-line ownership algorithm, introduced in the z10 system and described in more detail below.

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

Table 3

L2 directory ownership tag states.

IM ¼ 0, MC ¼ 1, read-only and unowned by any processor IM ¼ 0, MC ¼ 1, read-only and owned by one or more processors IM ¼ 1, MC ¼ 0, read-only and owned by one or more processors IM ¼ 1, MC ¼ 0, read-only and owned by one or more processors, changed IM ¼ 1, MC ¼ 0, exclusive for an I/O lock IM ¼ 1, MC ¼ 0, exclusive and unowned by a processor IM ¼ 1, MC ¼ 0, exclusive and unowned by a processor, changed IM ¼ 1, MC ¼ 0, exclusive, owned by a processor IM ¼ 1, MC ¼ 0, exclusive, owned by a processor, changed IM ¼ 1, MC ¼ 0, exclusive, owned by a processor, locally changed IM ¼ 1, MC ¼ 1, read-only, owned by one or more processors IM ¼ 1, MC ¼ 1, read-only, owned by one or more processors, changed Invalid Invalid and deleted Invalid and reserved for I/O

As with the prior-generation System z machines, the IM directory state indicates that the associated line address in a given L2 cache is the most recently cached data copy in the system and the PU book containing the data copy has the system coherency arbitration point for that address. In cases in which conﬂict arises about the address of the data copy, this directory state is used to expedite the conﬂict resolution and the return of data to a requesting processor. In cases in which the most recently cached data copy resulted from a read-only type fetch that was sourced from the L2 cache of another PU book within the system, the MC cache directory state is activated. This indicates that a read-only copy of the data exists within multiple PU books in the system. If the line is subsequently fetched in an exclusive state, the MC bit provides an early indication that remote copies of the data may need to be invalidated. Just as with prior machine cache management algorithms, if the cache directory ownership state for a given line address is invalid within the L2 cache, the associated data does not exist within the L2 cache. Also, in order for an address to be valid, it must be either

P. MAK ET AL.

2:7

IM ¼ 1 or MC ¼ 1 within the L2 cache directory. All lines that are held exclusive must be IM ¼ 1 and MC ¼ 0, and all lines that are IM ¼ 0 are read-only and unchanged, by deﬁnition. The new concept of subset processor line ownership has been introduced to the z10 L2 directory cache management scheme. Whereas prior System z L2 cache management algorithms would indicate line ownership to either a single processor or to all locally attached processors (either/or), the z10 directory cache management scheme maintains subset processor ownership states. These subset states track the ownership of a given line independently on a processor basis and a CP chip basis, inasmuch as a line that was initially owned by processor 0 on CP chip 0 and fetched by processor 1 on CP chip 1 in a read-only state would be updated as owned by processor 0/1 in CP chips 0/1 in the L2 directory. In this case, we have an overindication of the processor line ownership state as a result of multiple disparate processors on separate chips owning the same line, but the net gain from this algorithm is reduction in the volume of system cross-interrogation activity during line invalidations when compared to the one-or-all ownership indication states from prior designs. Also new to the System z10 mainframe design is the LC cache directory state concept. The LC directory state is a hardware mechanism designed to speculatively expedite the return of line exclusivity to a requesting processor on operand conditional-exclusive fetches. During the z10 development cycle, performance and software behavior analysis found that when operand fetches encounter a line that was previously modiﬁed by a processor that still owns the line, there was an extremely high probability that if the line was returned in a read-only state, the processor would subsequently fetch the line exclusive for modiﬁcation. The LC directory state tracks line modiﬁcations during the line ownership tenure of a given processor, enabling subsequent operand fetches by another processor from anywhere in the system to detect this modiﬁcation and return the line in an exclusive state to the subsequently requesting processor, whereas in prior designs, the line would be returned in a read-only state. Furthermore, contained within the z10 cache directory states and cache management algorithms are the I/O reservation and I/O partial store lock register directory states that were introduced to reduce the amount of system hardware dedicated to processing DMA operations. The I/O reservation directory state allows for a subset of the cache compartments within the L2 cache to be reserved by Licensed Internal Code (LIC) [13] during the machine bring-up process for use as special noncoherent system buﬀers for DMA read operations and also as response register stacks for interprocessor and I/O communications. The I/O lock directory state enables a cache entry to remain in a nonaccessible locked state

2:8

P. MAK ET AL.

while a DMA write sequence proceeds to do a readmodify-write atomic operation on the address. Both of these special directory states eliminate the need to keep system resources active to lock the address, reduce the amount of specialized hardware dedicated to DMA operations processing, and provide the ability to have more DMA operations in ﬂight at a given time. In addition, these specialized hardware cache management algorithms allow for the number of reserved cache positions to be scaled with the number of conﬁgured I/O ports within the system while otherwise leaving the unused regions as normal cache compartments for caching processor requests. In addition to the updates in the cache directory states, the z10 L2 cache design supports a number of new system operations that enable better hardware and software synergy. These new operations include the following processor commands: Demote to Read-Only/Release for Store, SW Untouch/L1.5 Exclusive LRU, and HW/SW PreFetch. The Demote to Read-Only system operation enables a given processor and instruction stream to demote ownership of a line from an exclusive state to a read-only state upon completion of its request stream; a subsequent request to the L2 cache will incur only the respective latency penalty of an L2 cache access without incurring the latency penalties from system cross-interrogation activity. In addition, the z10 processor private L1.5 caches use this same feature to demote the L2 cache ownership states of exclusive lines targeted for LRU replacement in the L1.5 in order to expedite subsequent processor request response times to these target addresses. The SW Untouch instruction takes the Demote to ReadOnly instruction to the next level in that it allows a given processor and instruction stream to completely relinquish ownership of a line upon completing the processing on a given address, inasmuch as a subsequent request to the L2 cache does not have to incur any line intervention latency penalty because the processor previously owning the line has given up all ownership claims to the address. Furthermore, through the relinquishing of ownership upon a subsequent fetch coming in to this address, no system cross-interrogation activity is required, further reducing the amount of activity in the system and any potential queuing delays. Finally, the HW/SW PreFetch instruction was introduced as a mechanism for software to specially mark lines targeted by speculative software prefetching so that the z10 processor can launch the request when system resources are available and otherwise terminate the event. The new command also enables better instrumentation and debugging of software behavior in order to isolate performance problems and provide potential solutions to software-related issues.

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

Fabric topology and multibook coherency The z10 system comprises one to four PU books, each interconnected through a set of unidirectional point-topoint buses, commonly referred to as the system fabric, or fabric for short. There is one bus for each direction of communication from a given SC chip to all of the others, producing a fully connected topology in which any of the PU books within the system can be omitted on the basis of individual customer system capacity requirements. Compared to the ring-based structures [3] of priorgeneration System z designs, this fully connected topology produces a number of advantages. First, the point-to-point connectivity reduces the communication latency incurred between PU books. Second, it also simpliﬁes the multibook coherency protocol by eliminating a number of coherency window conditions that were unique to the intermediate or pass-through PU books within a ring-based structure. Third, the full connectivity eliminates the need to support a diﬀerent fabric protocol used to support an open-ring topology (i.e., a temporary state to enable concurrent repair or service of PU books) that was, from a coherency management perspective, dramatically diﬀerent from the closed-ring protocol. Finally, it eliminates the need for jumper cards to complete the system connectivity in partially conﬁgured systems. Each unidirectional fabric bus in the z10 system is divided into two separate communication regions: one that shares address, command, and data-transfer requests, and another that is used for partial, combined, ﬁnal, and data tag responses. In this system, when a processor request within a multibook system misses its local L1 and L1.5 caches, the request traverses the on-CP-chip interconnect and enters the SC chip. Upon entering the SC chip, the request polls the local L2 directory and, if it detects a cache miss or the desirable directory state hit, it initiates a sequence of events on the fabric to obtain the cache line from a remote cache or system main memory, as applicable. Figure 4 shows the fabric protocol and bus transaction sequence for a remote L2 data intervention. First, the L2 on the local PU book SC chip initiates a fabric address broadcast, which transmits the request command and address upon which all of the remote PU books in the system will be snooping. Upon receiving these requests, each remote PU book polls in parallel their respective L2 cache directories in order to determine the state of the target address in their local directories. Upon completing this polling, each remote PU book sends back a partial response to the requesting PU book, which simply communicates both the state of the line in the remote cache and whether any cross-interrogation behavior was done on the remote PU book during the directory-polling

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

Processor request B0 B1

Address broadcast B0 B1

B2

B2

B3

B3

Partial response B0

B1

B2

B3

B0

B1

B0

B1

Combined response B0 B1

B2

B3

B2

B3

B2

Final response

Data transfer

B3

Response to processor B0 B1

B2

B3

Figure 4 The fabric protocol and bus transaction sequence for a remote L2 data intervention.

pipe pass. A complete list of the partial responses is provided in Table 4. If the remote PU book detects the line in a high directory state, IM, upon completing a short sequence of coherency management and line address protection establishment (activating IM pending compare), it can source the data back to the requesting PU book through a data response. In cases in which no IM copy of the data exists within the system, data must be accessed from the system main memory on the respective PU book for the line address, during which time the line address would be protected by the MM, pending compare. For the case in which the incoming request detects a compare against another requester that is currently accessing the line address, the applicable reject partial response is sent back to the requesting PU book. The data response requires synchronized use of the two communication regions on the fabric: the data tag response region, which provides routing and status information for the data that will be sent, and the data transfer region, which contains the 256 bytes of data sent over a number of subsequent fabric bus cycles. The segmentation of the regions also enables the z10 design to dynamically detect and recover from fabric unexpected errors in cases in which a data transfer is underway on the fabric bus.

P. MAK ET AL.

2:9

Table 4 List of fabric responses and the response ordering rule. (IM: intervention master; XI: cross-interrogation; MM: memory master.) Order

Response

1

IM Hit and XI Sent

2

IM Hit

3

IM Reject

4

MM Reject

5

Read-Only Hit and XI Sent

6

Read-Only Hit

7

Local IM Hit

8

Miss

On the requesting PU book, upon receiving the partial responses from all of the remote PU books, the partial responses are checked for consistency before being merged into a combined response on the basis of speciﬁc coherency ordering rules. The combined response is then sent back to all of the remote PU books. A complete list of the partial and combined responses with their respective ordering rule is provided in Table 4. The ordering in which the partial responses are merged to form the combined response is based on the ranking in Table 4, with the IM Hit with XI Sent partial response being the highest-ordered response. The combined response communicates the request status in accessing the given line from a system-coherent perspective and provides all of the remote PU books with a direction to proceed in completing their request on behalf of the requesting PU book. On obtaining the regular (non-reject) coherency responses, the requesters within the system continue processing to ensure that the remote PU book L2 directory states are consistent with the ﬁnal requesting PU book L2 directory state. In the case of a reject combined response, the requesters within the remote PU books terminate processing, leaving their local directory states unchanged. Upon completing their respective processing, each remote PU book will then send a ﬁnal response to the requesting PU book. This ﬁnal response serves several purposes. First, it informs the requesting PU book that the appropriate line coherency was successfully completed on each remote PU book, if applicable, and second, it indicates that the remote PU book resources have been reset, which prevents resource overrun issues. In the case in which the combined response was a reject, the requesting PU book restarts the system-coherency polling with a new address broadcast and proceeds with the steps

2 : 10

P. MAK ET AL.

that follow until it obtains a non-reject coherency response. Depending on the state of the line within the z10 cache subsystem and the coherency actions that occur, the processor fetch request can be satisﬁed at the time of the data response or after any or all of the ﬁnal responses are received. This is possible in the z10 design as a result of the cross-interrogate behavior information that is communicated along with remote directory state information on the partial response packet. At the time of this response, the requesting PU book is informed of the state of the line in each L2 directory and, in certain cases, the successful completion of remote cross-interrogate actions, upon which the local L2 can return an early response to the requesting processor. Embedded within this multibook coherency protocol is a system interlock, called fairness valid, that was designed to ensure fairness among processor requests in diﬀerent PU books during periods of high system contention. The fairness valid interlock acts much like a normal address compare within the system, but it only prevents new processor requests from being processed within the SC chip when other requests exist within the system that have been rejected on the fabric. This interlock allows all L2 requests currently engaged within the system to complete processing before the next set of processor requests can be initiated, producing a pseudo ordering of requests within the system. In unison with the processor request linked lists that are maintained within each PU book, the fairness valid maintains a timely ordering of all requests within the system. This is especially pertinent when software lock registers and blocks or dispatch queues are taken into consideration, as they tend to be highly contended lines within the system. Another unique feature of the z10 fabric protocol is the ability of the SCE function to place lines evicted from one L2 (due to age-out replacement selection on a fetch request missing the local L2 cache) into an empty slot within the remote L2 cache, provided that the home memory PU book for the line or address being evicted is located on the same remote PU book. This increases the line tenure within the shared L2 caches before it is written back to memory and increases the probability of ﬁnding a line within a remote PU book L2 cache instead of having to access memory, thus reducing the latency penalty that would be otherwise incurred.

Summary In support of a high-frequency processor design, the processor cache subsystem evolved from the z9 design in several directions in order to achieve the performance goal set during early planning of the z10 design deﬁnition. Some of the important new features include improving the cache hierarchy with the addition of the L1.5 cache

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

and the growth in the capacity of the PU-book-level shared L2, improving cache management for more precise tracking of inactive or eviction of exclusively owned lines from the L1.5 and L1 caches, adding software hints to prefetch data into the processor cache, and redesigning the processor subsystem structure topology to reduce transit times over the fabric buses. These new features all relate directly to key design objectives—overcoming the latency limitations of the physical packages and eﬀectively containing, or even improving, the average access times as measured in processor cycles. As big a challenge as it was to deal with the latency scalability problem on the z10 processor cache subsystem, it will be an even greater challenge for succeeding designs.

8.

9.

10.

11.

12.

Acknowledgments The success of the z10 processor cache subsystem development would not have been possible without the tireless dedication, drive, ingenuity, and professionalism of the engineers and management involved. In particular, we give special recognition to Michael A. Blake, Bing-lun Chu, Michael F. Fee, Rebecca M. Gott, Frank Malgioglio, David L. Rude, William J. Scarpero, Jr., Vern A. Victoria, and Rocco Crea for their leadership in guiding the development eﬀort through four years of technical challenges.

13.

Processor Chipset sx2000; see http://h71028.www7.hp.com/ ERC/downloads/5982-9836EN.pdf. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, ‘‘IBM POWER6 Microarchitecture,’’ IBM J. Res. & Dev. 51, No. 6, 639–662 (2007). C.-L. K. Shum, F. Busaba, S. Dao-Trong, G. Gerwig, C. Jacobi, T. Koehler, E. Pfeﬀer, B. R. Prasky, J. G. Rell, and A. Tsai, ‘‘Design and Microarchitecture of the IBM System z10 Microprocessor,’’ IBM J. Res. & Dev. 53, No. 1, Paper 1:1–12 (2009, this issue). E. W. Chencinski, M. A. Check, C. DeCusatis, H. Deng, M. Grassi, T. A. Gregg, M. M. Helms, et al., ‘‘IBM System z10 I/O Subsystem,’’ IBM J. Res. & Dev. 53, No. 1, Paper 6:1–13 (2009, this issue). E. W. Chencinski, M. J. Becht, T. E. Bubb, C. G. Burwick, J. Haess, M. M. Helms, J. M. Hoke, et al., ‘‘The Structure of Chips and Links Comprising the IBM eServer z990 I/O Subsystem,’’ IBM J. Res. & Dev. 48, No. 3/4, 449–459 (2004). H. G. Cragon, Memory Systems and Pipelined Processors, Jones and Bartlett Publishers, Inc., Sudbury, MA, 1996; ISBN 0867204745. IBM Corporation, Licensed Internal Code and License; see http://www-304.ibm.com/systems/support/machine_warranties/ licensed_internal_code.html.

Received March 4, 2008; accepted for publication June 18, 2008

*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. **Trademark, service mark, or registered trademark of Sun Microsystems, Inc., Hewlett-Packard Development Company, L.P., InﬁniBand Trade Association, and PCI-SIG in the United States, other countries, or both.

References 1. C. Webb, ‘‘IBM z6—The Next-Generation Mainframe Microprocessor,’’ IBM Corporation, 2007; see http://www2.hursley.ibm.com/decimal/ IBM-z6-mainframe-microprocessor-Webb.pdf. 2. IBM Corporation, Large Systems Performance Reference, Document Number SC-28-1187-12, February 2008; see http:// www-03.ibm.com/servers/eserver/zseries/lspr/pdf/SC28118712.pdf. 3. P. Mak, G. E. Strait, M. A. Blake, K. W. Kark, V. K. Papazova, A. E. Seigler, G. A. Van Huben, L. Wang, and G. C. Wellwood, ‘‘Processor Subsystem Interconnect Architecture for a Large Symmetric Multiprocessing System,’’ IBM J. Res. & Dev. 48, No. 3/4, 323–337 (2004). 4. P. R. Turgeon, P. Mak, M. A. Blake, M. F. Fee, C. B. Ford III, P. J. Meaney, R. Seigler, and W. W. Shen, ‘‘The S/390 G5/G6 Binodal Cache,’’ IBM J. Res. & Dev. 43, No. 5/6, 661–670 (1999). 5. P. Mak, M. A. Blake, C. C. Jones, G. E. Strait, and P. R. Turgeon, ‘‘Shared-Cache Clusters in a System with a Fully Shared Memory,’’ IBM J. Res. & Dev. 41, No. 4/5, 429–448 (1997). 6. Sun Microsystems, Sun Fire E25K/E20K Systems Overview, Document 817-4136-13, 2006; see http://docs.sun.com/source/ 817-4136-13/. 7. Hewlett-Packard Development Company, L.P., Meet the HP Integrity Superdome Server with the HP Super-Scalable

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

P. MAK ET AL.

2 : 11

Pak-kin Mak IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New York 12601 ([email protected]). Mr. Mak is the processor cache subsystem architect on the System z Processor Subsystem Development team. He received his B.S. degree in electrical engineering from Polytechnic University of New York and his M.B.A. degree from Union College. He joined IBM in 1981, working on his ﬁrst cache design in the IBM ES/3090* load/store unit and has since been involved with highperforming cache hierarchy and symmetric multiprocessing designs. He is currently responsible for developing the processor cache subsystem for the z10 successor.

Craig R. Walters IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New York 12601 ([email protected]). Mr. Walters is an Advisory Engineer. He received his B.S. degree in electrical engineering from the New Jersey Institute of Technology, and an M.B.A. degree from the State University of New York at New Paltz. He joined IBM in 1999, working as a logic designer on the z990, z9, and z10 SC designs. He has designed system controllers for the cache management and fabric coherency in the system controller element (SCE). In 2005, he joined the System z Processor Cache Subsystem Performance team, focusing on the z10 design and future systems. He holds two U.S. patents and has ten patents pending.

Gary E. Strait IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New York 12601 ([email protected]). Mr. Strait is a Senior Engineer in System z Hardware Development. He was the logic team leader for the z10 I/O controller. He joined IBM in 1980 after receiving both his B.S. and M.Eng. degrees in electrical engineering from Rensselaer Polytechnic Institute. He previously held design positions on the cache subsystem of the ES/3090, ES/9021 subsystems, and the I/O controller of the S/390* G4, G5, G6, zSeries* 900 and 990, and z9 systems. He has received four IBM formal awards, holds ﬁve U.S. patents, and has four patents pending.

2 : 12

P. MAK ET AL.

IBM J. RES. & DEV.

VOL. 53 NO. 1 PAPER 2 2009

Lihat lebih banyak...

IBM System z10 processor cache subsystem microarchitecture

Descrição do Produto

Comentários