Hierarchical, adaptive cache consistency in a page server OODBMS

Share Embed


Descrição do Produto

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

427

Hierarchical, Adaptive Cache Consistency in a Page Server OODBMS Markos Zaharioudakis and Michael J. Carey, Member, IEEE Abstract—Due to its simplicity and communication efficiency, many client-server object-oriented database systems are based on the basic page server architecture—pages serve as their smallest unit of data transfer, client caching, and concurrency control. In an earlier paper, we showed how to extend this architecture to permit object-level callback locking, and we showed through simulations that significant performance gains can be expected. In the current paper, we report on our experiences from implementing this approach in the context of the SHORE system. Since SHORE supports multiple lock granularities (volume, file, page, object), we explain how our callback algorithm can be extended to support multigranularity locking. We also discuss some of the stickier issues that arose as we moved our algorithm out of the simulator and into SHORE, which supports a generalized peer-servers architecture. Finally, we present performance measurements that explore the tradeoffs between page-level and object-level concurrency control. Our measurements were obtained by running SHORE on an IBM SP2 shared-nothing parallel machine. Index Terms—Databases, client-server systems, cache consistency.

—————————— ✦ ——————————

1 INTRODUCTION

R

ECENT years have seen a great deal of research and development activity in the area of object-oriented database management systems (OODBMS). There are now a number of commercial offerings in this area, and these systems are beginning to gain acceptance for certain classes of commercial applications (e.g., CAD/CAM, CASE, and telecommunications). This generation of DBMSs is being deployed primarily in data-shipping, client-server environments—data items are shipped from servers to clients so that application and query processing can be performed at the client workstations. An important source of performance gains in such systems is the exploitation of the memory available at the clients; the objective is to minimize communication and server disk accesses by caching data items in client memories. The utilization of client memories is improved by inter-transaction caching [26], [4], [24], [11], where data may remain cached and accessible without server intervention, even across transaction boundaries. Intertransaction caching allows multiple copies of data items to reside in different client caches. As a result, replica management is required, in addition to concurrency control, to ensure that all clients see a consistent (serializable) view of the database. The term cache consistency is often used to refer to concurrency control and replica management together. This paper presents a flexible and dynamic approach to cache consistency that aims to increase concurrency in client-server OODBMSs in an efficient manner.

1.1 Granularity Issues Cache consistency can be maintained at any of several granularities, e.g., object, page, or file. In general, using a ²²²²²²²²²²²²²²²²

• The authors are with the IBM Almaden Research Center, 650 Harry Road, K55/B1, San Jose, CA 95120. E-mail: {markos, carey}@almaden.ibm.com. Manuscript received 1 Apr. 1997. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 106306.

coarse granularity involves less overhead, but it raises the potential for conflicts due to false sharing, i.e., conflicts arising at a coarse granularity due to concurrent access to distinct, but co-located, finer-grained items. In [5], we investigated the trade-offs between page and object level cache consistency, and we showed that the choice of the appropriate granularity can have a significant impact on the performance of an OODBMS. However, in a distributed environment, even page level cache consistency can be too expensive when contention is very low. Moreover, there are operations, such as file scans, where file level cache consistency is the appropriate choice. There is, therefore, a need for approaches that allow the cache consistency granularity to be chosen on a per-operation or per-workload basis. Hierarchical locking [12] is a well-known technique that has been used to address this need in the context of centralized DBMSs. Hierarchical locking can be even more valuable when combined with an adaptive cache consistency approach that tries to use the coarsest possible granularity, dynamically switching to finer granularities only as dictated by the observed level of data contention [17], [14], [5]. Another caching-related system function for which a choice of granularity has to be made is the unit of data transfers between clients and servers. Most OODB systems take one of two basic approaches: the page server approach, where clients and servers interact using physical units of data (e.g., individual pages or groups of pages), and the object server approach, where the client-server interactions involve logical units of data (e.g., individual objects). Previous work on the trade-offs between page and object servers [8], [27] has shown page servers to be more robust performance-wise, mainly due to their communication efficiency. As a result, we will concentrate here on the page server approach.

1.2 Our Focus As mentioned above, the focus of this paper is on the development of an efficient, flexible, and highly concurrent

0018-9340/98/$10.00 © 1998 IEEE

428

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

cache consistency protocol for client-server and distributed OODB systems. The algorithm presented here attempts to meet its goals by combining inter-transaction caching with fine-grained (i.e., object) and hierarchical/adaptive locking in the context of a page server architecture; this combination is the first contribution of this paper. The basic ideas underlying this work were introduced in an earlier paper [5], which presented and simulated a simple version of the current algorithm. Thus, another contribution of this paper is a description of the key issues and problems encountered during an actual implementation in SHORE [6], a persistent object system under development at the University of Wisconsin. SHORE offers various new features; the one most relevant to our work is its architecture, which is based on the peer-servers model, a generalization of the traditional client-server model. Building our algorithm to comply with the peer-servers architecture of SHORE was a challenging aspect of our work. The final contribution of this paper is a performance study conducted by running SHORE on an IBM SP2 shared-nothing parallel machine. One goal of this study is to confirm the general predictions of our earlier simulation study. Another goal is to explore the trade-offs between page and object level cache consistency for the peer-servers architecture. The remainder of this paper is organized as follows: Section 2 surveys related work. Section 3 describes the peerservers model and the basic system components in order to provide the context for our algorithm. Section 4 describes the design and implementation of our cache consistency algorithm. Section 5 describes the system configurations and workloads used in our performance study, and also presents and analyzes the results of this study. Finally, Section 6 summarizes our conclusions.

2 RELATED WORK Our work in this paper extends our earlier work in [5], [27]. In those papers, we identified three caching-related functions where a granularity choice has to be made: 1) client-server data transfers, 2) concurrency control, and 3) replica management. We demonstrated that the choices of granularity for these functions are largely orthogonal and that, contrary to the common practice, one can choose different granularities for these functions. We described pure page server (PS) and pure object server (OS) approaches based on the callback locking [16], [24], [11] protocol. We then presented three new hybrid algorithms that support fine-grained cache consistency in the page server context. The first, PS-OO, does object-level locking and callbacks; the second, PS-OA, does object-level locking and adaptive callbacks; the third, PSAA, is adaptive with respect to both locking and callbacks. We conducted a simulation-based performance study of these approaches. For the range of workloads examined, the proposed fully adaptive page server algorithm (PS-AA) was shown to provide very good performance, generally outperforming all the other page-server alternatives. PS-AA was often much better than the pure object server as well,

except in some cases involving very poor clustering of related objects in the database; in those cases, PS-AA performed poorly (compared to OS) because it shipped and cached whole pages that contained many objects that were useless with respect to the requesting client, and those objects were taking cache space to the expense of other objects that were more needed. Research closely related to ours is also being done in the context of the Thor project at MIT. The Thor group has proposed another hybrid protocol to support object-level cache consistency when the granularity of data transfers is coarser than individual objects [1]. Instead of locking, a backward validation optimistic scheme is proposed. Transactions can both read and update locally cached objects without server intervention. However, before a transaction commits, it must be “validated”; the server must make sure that the validating transaction has not read an old version of some object that was updated by another successfully committed (or validated) transaction. If validation fails, the validating transaction is aborted. The protocol described in [1] uses a form of callbacks to reduce the number and the cost of aborts. A successfully validated transaction sends callbacks to any clients that are caching objects updated by the transaction. These callbacks are asynchronous, so the transaction can commit without waiting for the callback replies. A client receiving a callback for an object checks whether the object has been accessed by the current local transaction (in Thor, a given client process can run only one transaction at a time). If so, the local transaction is aborted, which saves resources since the transaction would otherwise abort later. If the object has not been accessed, it is invalidated to prevent subsequent reads of the old cached version. To evaluate this optimistic algorithm, the authors of [1] compared it to our adaptive callback locking algorithm (PS-AA) via a brief simulation study. They concluded that their optimistic algorithm performs better in low contention cases; however, we believe that their results are inconclusive—only one experiment was presented, with only one workload and a single system and database configuration. Important related work on shared disk systems has been done at IBM Almaden [19]. That work proposed using strict two-phase locking on objects to ensure serializability, while using “physical” locks on pages to ensure cache consistency. The physical locks can either be released during a transaction or they can be held across transactions. The basic idea is to allow a given page to be held in shared mode simultaneously at multiple sites, but to ensure that only one site at a time can gain update access to the page. To update a page, a site is required to become the “owner” of that page; when a different site wishes to update the page, it must be sent the most recent copy of the page by the current owner. This approach, which is related to work on distributed shared memory [18], was designed to exploit the relatively inexpensive internode communication paths usually found in tightlycoupled data sharing architectures. Several alternative algorithms (differing in their crash recovery implications) were proposed, and the algorithms were ordered based on their designers’ performance expectations. No actual analysis of their performance was attempted, however.

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

429

Fig. 1. The peer-servers architecture.

Another dimension for considering hybrid techniques is in the management of client buffer space. A technique called Dual Buffering was proposed for this purpose and studied by Kemper and Kossmann [15]. Dual Buffering uses pages as the unit of transfer between clients and servers, but it manages both pages and individual objects in a client’s buffer pool. The approach used is to segment the buffer pool and to use replacement policies that are specific to each segment. Dual Buffering supports both page segments and object segments. Data pages are initially brought into the client cache as the result of object faults and placed in a page segment of the buffer pool. Later, individual objects can be copied out of their “home page” into an object segment. This copying allows the buffer space used by less worthy objects to be reclaimed by evicting their home page from the cache. Related work on hybrid client cache management has also been done in the context of the MIT Thor project by O’Toole and Shrira [22].

3 SHORE ARCHITECTURAL CONCEPTS The purpose of this section is to describe the environment for which our cache consistency algorithm is intended. Section 3.1 describes the basic characteristics of the peerservers model and talks about its advantages relative to the more traditional client-server model. At this level, the peerservers model leaves a number of architectural issues open to the OODBMS implementor. Section 3.2 describes SHORE-specific decisions regarding some of these issues, with the emphasis being on process and communication structure. Finally, Section 3.3 discusses some additional, transaction-related system functions and describes SHORE’s current implementation of these functions.

3.1 The Big Picture Fig. 1 illustrates the general structure of a data-shipping, peer-servers OODBMS. The system executes as a group of cooperating DBMS processes called peer servers. The database is distributed across a set of disk volumes, each of which

is owned and managed by a single peer server. Peer servers are responsible for preserving the integrity of their data and for enforcing transaction semantics. For performance, each peer server manages a local in-memory data cache, which may contain data both from its local volumes and from remote servers. Each application program interacts with the system by communicating directly with a single peer server: applications submit requests to their local peer server, which executes the request, sometimes by sending requests for transaction support or for specific data to remote servers. In contrast, the more traditional client-server model includes client DBMS processes that act as intermediates between applications and servers; clients do not manage any data of their own, and servers do not interact with applications directly. The peer-servers architecture provides greater flexibility than the traditional client-server architecture. When acting as the owner of a page, a peer server performs the role of a traditional server in the client-server model. When acting as the local agent for an application’s remote data access, it plays the role of a client. Letting peer servers assume both roles allows data placement to be optimized according to the workload. In addition, the location-transparency provided by the peerservers structure allows an application to access both local and remote data in a identical manner. Further, the design is scalable and parallelizable; it can run on a single processor, a cluster of workstations, or a large parallel system. Finally, each application can be a separate process, communicating with its local server via RPC or shared memory, or it can be linked with a peer server into a single process.

3.2 SHORE Process and Communication Structure Fig. 2 shows the SHORE process and communication structure in more detail. Each application consists of a transaction (or a sequence of transactions) which is executed on the local peer server. Servers are multithreaded, and threads are assumed to be preemptive, at least potentially. On the server where it is initiated, a transaction is run by its master thread. The first time that it requests data

430

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

Fig. 2. SHORE process and communication structure.

owned by a remote server, the transaction spreads onto the remote server. A remote thread is allocated by the remote server for handling requests from the master thread. Intratransaction messages, such as data and lock requests, are sent directly from the master thread to the remote thread. The remote thread remains active until the end of the transaction. Additional threads are dedicated to other kinds of message traffic, such as various “top-level” actions (e.g., transaction spreading) and log record shipping. Compared to traditional client-server OODB architectures, the peer-servers architecture is somewhat more challenging as an implementation platform for caching. A fairly common assumption made by client-server systems is that client processes are single-threaded, each running at most one transaction at a time. Of course, a given machine may run multiple client processes, but each maintains its own private cache which is not shared with other client processes. In SHORE, a peer server (in its client role) is allowed to run many transactions concurrently, all sharing a common data cache. Another common assumption is that there is a single, order-preserving communication path between a client and a server. As shown in Fig. 2, our design allows multiple paths between peer servers. Message ordering is preserved along each path, but messages from server A to server B can arrive at B out of order if they are sent along different paths. In fact, even if a single communication path is used, although two messages that have different threads as their destination will arrive at server B in the order in which they were sent, they may ultimately be handled out of order since threads are assumed to be preemptive. The design of Fig. 2 offers two performance advantages over possible simpler alternatives. First, by sending messages directly to the final destination thread, it avoids the overhead of having a dispatcher thread in each peer server receive all messages and then forward each message to its destination thread. Second, it can fully exploit parallelism.

For example, each peer server may run on several processors of a shared memory parallel machine; in this case, threads can run in parallel and there can be multiple physical or TCP-like connections between two peer servers. The disadvantage of the design is, of course, its complexity. In particular, various race conditions can arise due to the looseness of message ordering. Such race conditions are discussed further in Section 4.2.

3.3 Other Transactional Issues As shown in Fig. 1, a given database may be distributed across multiple peer servers. In such a case, distributed deadlock detection becomes an issue. At present, a simple solution based on lock-wait timeouts is used to resolve distributed deadlocks in SHORE. A two-phase commit protocol is used to commit transactions that have updated data from multiple servers, and each peer server uses the ARIES algorithm [20] to provide for transaction rollback and for recovery of the data that it owns. In describing our algorithm in this paper, we will talk as though the entire database is owned and managed by a single peer server; the rest of the peer servers will, therefore, act as (multithreaded) clients. Transactions are submitted at these “clients” and ask for data from the server. It should be noted that this assumption was not made during the implementation; its only purpose is to simplify the discussion that lies ahead by avoiding the need to continually distinguish between the server and client roles of a peer server. A redo-at-server scheme is currently used in SHORE for propagating updates from a client to the server. A client generates a log record when it updates a cached object. Log records are stored temporarily in a local log cache and are shipped back to the server during the two-phase commit process (or earlier, if a dirty page is replaced in the local cache). Thus, a client never sends copies of dirty objects or pages back to the server. When the server receives log records from a client, it

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

redoes the operations indicated by the log records in order to install the updates that they describe. When a transaction aborts, it deletes its log records from the log cache and purges from the local page cache any objects that it has updated; this is accomplished by marking the objects as “unavailable” (see Section 4). Any updates of the aborting transaction that have already been shipped to the server are undone by the server. This redo-at-server approach was chosen for the initial SHORE prototype because was easy to implement and is less communication-intensive than other schemes that ship entire dirty objects or pages (and usually log records as well) back to the server. The primary disadvantage of this method is that, when redoing log records, the server must reread from the disk any pages that are not currently resident in its local cache [25] (an approach to alleviate this problem is proposed in [21]).

4 A HIERARCHICAL CACHE CONSISTENCY ALGORITHM In this section, we describe our new cache consistency protocol. Our approach is based on the Callback-Read algorithm studied in [11] for page level locking in a client-server architecture. As discussed in Section 2, Callback-Read was extended in [5] to support fine-grained cache consistency in a page server context. In this paper, the adaptive algorithm (PS-AA) from [5] is further extended in two dimensions: to support hierarchical locking and to handle the full complexity implied by a real implementation in the context of SHORE’s peer-servers architecture. Before we embark on the description of our algorithm, we need to set the stage regarding notation. As shown in Fig. 2, a transaction originates at a client and then spreads to the server. As we discuss below, a transaction may also spread (temporarily) to other clients during callbacks. Therefore, a single transaction can have many threads at a time, each running on a different site. In the discussion that follows, we will use the name of the client where the transaction originates together with a numeric id unique within that client to globally identify a transaction. To distinguish different threads of the same transaction, the global id will be augmented with the name of the site where a particular thread is running. Our cache consistency algorithm will be described in three steps. First, the basic ideas of the algorithm are given and, next, some of the interesting/hard parts are described in more depth. In these two steps, we will assume that transactions only ask for object-level locks. In the third step, we show how the algorithm is extended to handle explicit transaction lock requests at any level of a locking hierarchy. We will consider a four-level hierarchy consisting of objects, pages, files, and volumes. In every case, when a transaction thread requests a lock on item I, the lock manager at the site where the thread runs automatically acquires the appropriate intention mode locks on the ancestors of I (if any). The usual five lock modes are supported: IS, IX, SH, SIX, and EX [12]. Finally, during most of the following descriptions, it will be assumed that objects are smaller than a page and the updates do not change the size of objects; large objects and size-changing updates are discussed in Section 4.4.

431

4.1 The Basic Algorithm Callback-Read guarantees that copies of objects in client caches are always valid, so clients can read cached objects without server intervention. In traditional page server systems, an object is cached at a site if its containing page is cached there. Our fine-grained cache consistency algorithm requires that a page-based buffer manager be extended to keep track of the “available” objects within each cached page. An object is considered locally cached if its containing page is cached and the object is marked as “available”; otherwise, the object is not cached, and the client must request the object from the server in order to read it. In contrast to object reads, clients must always request the server’s write permission in order to update an object. Write permission is granted after the requested object is locked exclusively at the server and all cached copies of the object (except the one at the requesting client) have been invalidated. The server maintains a copy table to keep track of the locations of cached pages throughout the system. The basics of the object-read and object-write operations are described below. In Section 4.1.1, we ignore adaptive locking; thus, this section describes the PS-OA algorithm of [5]. Adaptive locking is then taken into account in Section 4.1.2.

4.1.1 Object Locking w/Adaptive Callbacks (PS-OA) In order for a transaction A1 (on client A) to gain read access to an object, its master thread A1,A first obtains a local read (SH) lock on the object. Next, A1,A checks whether the object is locally cached. If so, the transaction can safely read it without server intervention. Otherwise, a read request is sent to the server, i.e., to thread A1,S. A1,S asks for the object read lock at the server as well. When the lock is (eventually) 1 granted (if possible ) a copy of the page that contains the requested object is sent to the requester. However, before shipping the page, A1,S updates the server’s copy table, and marks as “unavailable” any objects in the page that are write-locked by transactions running on clients other than A. When thread A1,A receives the page from the server, it places the page in the local cache. If, however, the page is cached already and contains objects already updated by A1 or other active local transactions, A1,A merges the incoming copy with the local copy of the page (being sure not to overwrite any of the dirty objects). All of the objects on the page, except from those marked as “unavailable,” are then considered to be cached at the client. In order to update an object X that resides inside a page P, a transaction A1 must obtain write (EX) locks on the object both locally and at the server. Fig. 3 shows the message flow during an update operation. As before, the local lock is obtained first at client A; a write request is then sent to the server. At the server, thread A1,S obtains an EX object lock and then issues callback requests to all clients (except its own) that have a cached copy of page P. Fig. 3 indicates the three cases that can arise when such a client receives a call2 back request. At client B, the page has not been accessed by 1. In general, a lock request may fail if a deadlock occurs and the requesting transaction is chosen as a victim. 2. In every case, a new thread is allocated at the client to handle the callback request. The thread is associated with the calling-back transaction. When the callback is done, any locks that have been acquired by the callback thread are released and the callback thread itself is deallocated.

432

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

Fig. 3. Message flow during a write operation.

any active local transaction; the callback thread A1,B acquires an EX lock on the whole page, purges the page from the cache and, then, finishes by sending an acknowledgment to the server. At client C, no active local transaction has accessed object X, but transaction C1 has accessed another object in page P. In this case, A1,C cannot acquire the EX page lock immediately. Instead of waiting for the page lock, A1,C requests an EX lock on object X, which is granted immediately. A1,C then marks object X as “unavailable,” essentially purging the object from C’s cache. As before, A1,C terminates by sending an acknowledgment to the server. Finally, at client D, transaction D1 has accessed object X. In this case, thread A1,D must wait for an EX lock on X. Before doing so, A1,D sends the server a list of all local transactions holding locks on X (in the example of Fig. 3, this list contains D1 only). As we discuss in Section 4.2, this list is used to perform deadlock detection at the server. Eventually, after the conflicting transaction D1 terminates, A1,D will obtain the requested EX lock on X. It will then mark the object as “unavailable” and send an acknowledgment to the server. When A1,S has received acknowledgments from all of the callbacks it has issued, it sends write permission back to its master thread (A1,A), which then updates the object inside A’s cache.

Our last remark in this section concerns the management of the server’s copy table. When a page drops out of a client cache, the server must eventually be informed so that it can delete the associated entry from its copy table. To do so, information about purged pages is piggybacked on messages sent to the server. If a page is purged while it is used by an active local transaction, information about the local locks held on the objects of the page is also piggybacked on the next message to be sent to the server. These locks will then be replicated at the server when the message arrives there.

4.1.2 Adaptive Locking w/Adaptive Callbacks (PS-AA) Our cache consistency algorithm uses adaptive locking in order to reduce the number of write requests sent from the clients to the server. In the absence of page level conflicts, transactions acquire adaptive page locks when they update a page. An adaptive page lock gives a transaction the server’s permission to update any object in the page without any further server intervention. Transactions deescalate to finergrained operation only for pages on which conflicts arise and, then, “reescalate” if they determine that the contention that caused the deescalation has dissipated. When a transaction A1 wishes to update an object X, a local EX lock is acquired on X first (as in PS-OA). A1,A then checks whether it holds an adaptive lock on the page that contains X.

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

433

Fig. 4. Callbacks and deadlock detection.

If so, the transaction can proceed with the update. Otherwise, a write request is sent to the server. Initially, the write request is served as in PS-OA—an EX object lock is acquired at the server, and callbacks are sent to the clients caching the page (if any). At each client, the callback will invalidate the whole page if possible. After thread A1,S has collected all the callback acknowledgments, there are two possible outcomes: If one or more clients were using the page, then no adaptive page lock is possible, and transaction A1 is granted permission only to update the specific object (as in PS-OA). Otherwise, if the page has been successfully invalidated everywhere (i.e., nobody was using it), an adaptive page lock is granted to A1. Rather than introducing a new lock mode—whose compatibility with the other modes would be hard to define—an adaptive page lock is represented at the server by setting an adaptive bit inside the page lock held by the requesting thread (at least, an IX page lock must exist since the requesting thread holds an EX lock on the requested object). Given this implementation of adaptive locks, multiple transactions from the same client may hold adaptive locks on a page P simultaneously. The local object locks, which are always acquired at the home client, are used to detect object-level conflicts among these transactions when they access page P. While a transaction from a client A holds an adaptive lock on page P, no other client can have a copy of page P cached. Thus, any attempt by a transaction from another client, say transaction B1, to access (read or write) an object X in page P will surely result in a request being sent to the server. Upon receiving the request, thread B1,S will deescalate any adaptive locks held on page P before requesting the object lock on X. To do so, B1,S sends client A a deescalation request. Client A replies with a message that contains a list of the EX object locks that are being held by local transactions on objects inside page P. After receiving the deescala-

tion reply, B1,S will replicate at the server the object locks specified in the reply, doing so on behalf of the transactions whose adaptive lock on page P is being de-escalated. B1,S can then proceed by requesting the appropriate lock on the original object X.

4.2 Design and Implementation Issues In this section, we will be describing both some policy decisions taken during the design of our algorithm and the mechanisms that we chose to implement these policies. Our engineering philosophy in choosing such mechanisms was to try to “project” the various scenarios that can arise in our distributed system into corresponding centralized scenarios. Specifically, for each distributed scenario, we try to create a lock state at the server that could have arisen under a similar centralized scenario (i.e., if there were no clients and all transactions were submitted directly to the server). The advantage of this approach is that it makes it easier to guarantee the correctness of our algorithm. Since lock state in a centralized DBMS is represented by a lock table, we have avoided introducing additional data structures to represent the server’s lock state. Instead, a traditional centralized lock table was extended to support our distributed cache consistency algorithm; the required extensions were minor though. This choice offers an additional advantage in that a lock table is built as an efficient data structure to be shared among concurrent preemptive threads; adding new and different data structures in an environment of preemptive threads can be hard because of the additional thread synchronization mechanisms required to access these data structures.

4.2.1 Callbacks and Deadlock Detection Distributed deadlock detection is an important issue in any distributed system. In general, callback-based algorithms

434

provide for the early detection of some distributed deadlocks by replicating, at the server, information about lock conflicts that arise at clients during callback requests. We will describe the way our algorithm works in this respect by using the example illustrated in Fig. 4. Let us assume that transaction A1 wishes to update an object X that is contained in page P. After thread A1,S acquires the EX lock on X at the server, it issues callbacks to clients B and C, both of which are caching page P. At client B, object X is not locked by any local transaction, so the callback returns immediately. However, suppose that right after the callback returns from client B, transaction B1 tries to read object X (which is not cached there any more). B1,B will send a read request that will block at the server due to the EX lock held there by A1,S. At client C, the callback request blocks due to a lock held on object X by thread C1,C. As described in the previous section, A1,C sends a “callback-blocked” reply containing C1’s transaction id. In the current scenario, this reply arrives at the server after B1’s read request. Figs. 4b and 4c show the state of the server’s lock table with respect to object X before and after the arrival of the “callback-blocked” reply. In Fig. 4b, thread B1,S has received the read request for object X, and it is waiting behind thread A1,S for a SH lock on X. As shown in Fig. 4c, upon receiving the “callback-blocked” message, thread A1,S downgrades its EX lock on X to a SH lock, acquires a SH lock on X on behalf 3 of thread C1,S, and then tries to upgrade its SH lock on X. At this point, the lock conflict detected at client C has been replicated at the server, and A1,S invokes the deadlock detector to check for deadlocks. In this case, there are none. When transaction C1 terminates, it releases its locks—first at the server and then at client C—thus allowing both threads A1,S and A1,C to acquire their waited-for EX locks on object X. A1,C will then invalidate object X from C’s cache and, then, finish by sending the “callback-okay” to A1,S. When A1,S receives this waited-for acknowledgment, it will send the write permission on X to its master thread (A1,A). In the meantime, thread B1,S remains blocked, and it will stay so until A1 terminates.

4.2.2 Callbacks and Serializability In general, during a callback operation, the state of the object that is being called back (e.g., the places where it is cached and its lock state at each site) undergoes many transitions. Hence, during this time, the system is susceptible to serializability violations if other transactions are erroneously allowed to gain access to the object. In order to bound the complexity of our algorithm and protect against such errors, two objectives were set. First, there should be only one outstanding callback operation per object at a time. Second, during a callback operation for an object X, no transaction (except the calling-back one) should be allowed to gain access to X. In fact, the second objective can be rephrased more precisely as follows: After a callback thread at a client A terminates successfully, no transaction from client A should be allowed to fetch the object into A’s cache (and, thus, gain read permission on it) until the calling back transaction terminates. In the rest of this subsection, and in 3. Thread A1,S does not really block on this lock upgrade request; instead, it waits until it receives acknowledgments from all the callbacks it has issued.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

the next subsection, we show how these objectives are enforced. The first objective is achieved by use of the next two rules (illustrated previously in Fig. 4): 1) A calling-back transaction must acquire, at the server, an EX lock on the requested object before it sends out its callback requests, and 2) A calling-back transaction must downgrade (rather than release) this EX object lock when a “callbackblocked” reply is received (if any). Given these rules, a calling-back transaction holds at least an SH server lock on the requested object during the callback operation, thus preventing any other transaction from acquiring an EX lock on the same object and starting a concurrent callback operation for that object. In the example of Fig. 4, rules 1 and 2 were enough to make transaction B1 wait for A1, thus achieving the second objective as well. However, additional mechanisms are required to enforce the second objective in general; the next subsection describes these mechanisms.

4.2.3 “Unavailable” Objects and Serializability Let us reconsider the example of Fig. 4 with two modifications: Transaction B1 tries to read object Y (instead of X), which is in the same page as X, and B1’s read request arrives at the server after (rather than before) the “callbackblocked” message from client C. The read request for Y prompts thread B1,S to acquire a read lock on Y (which is granted with no delay) and to ship page P to client B. Before shipping the page, B1,S checks for objects that should be marked “unavailable.” Obviously, X is such an object because transaction A1 is trying to update X and, in this example, A1’s write request will (eventually) succeed. However, due to the “callback-blocked” reply from client C, X is not EX-locked by A1,S at the time that B1,S checks for unavailable objects. To handle this scenario, the rule for marking objects “unavailable” is restated here more precisely: Before shipping a page P to a requesting client, the server marks an object X in P as “unavailable” if: 1) X is not the requested object, and either 2) X is EX-locked by a transaction running on another client, or 3) there is a pending callback operation on X by a transaction running on another client. The purpose of condition 1 is to make sure that the requested object is always marked “available,” as this object is SH-locked by the requesting transaction at the server before applying this rule. (Without getting into further details, we note that, due to a race condition, it is possible for both conditions 1 and 3 to be true at the same time; that is why we need to make condition 1 explicit.) Another question that needs to be considered carefully is what objects should be viewed as “unavailable” once a new copy of a page arrives at a client. Consider an object X that arrives at a client A inside a copy of a page P. Object X may be cached at client A already. The final availability of X depends on its current cached state, the new state proposed by the server (indicated inside the new copy of P), and on whether a callback race (see below) has occurred. If object X

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

435

Fig. 5. Callback race condition.

is already cached at client A, then X will remain cached, even if its proposed new state is “unavailable.” (If the proposed state is “unavailable,” a callback for X is already on the way, or such a callback has arrived but is blocked by a local transaction; it will be this callback that will eventually invalidate X.) If the object is not cached and its proposed state is “unavailable,” the final state is also “unavailable.” Finally, if the object is not cached and the proposed state is “available,” the final state is “available” unless a callback race has been registered for the object.

4.2.4 Callback Races and Other Race Conditions A callback race is depicted in Fig. 5. Transactions A1 and B1 are trying to read and update objects X and Y, respectively. We assume that both X and Y reside in the same page P, which is cached at both clients, but that object X is not cached (i.e., it is “unavailable”) at client A. Upon receiving the read request from client A, the server replies with a new copy of P on which object Y is marked “available.” Shortly after the server ships the page to client A, the write request for Y arrives from client B and generates a callback for client A. Given our thread communication model (see Section 3.2), it is possible that the callback will be handled at client A before the read reply; this is the case shown in Fig. 5. The callback will then mark object Y “unavailable,” and Y should remain “unavailable” when the delayed read reply arrives—even though this reply says otherwise. To handle this race condition, a callback request checks, after acquiring the object lock at the client, whether there is a pending read request for the page of the specified object. If so, the callback request registers a callback race for the specified object into a callback race table. When a client receives a read reply, it searches the callback race table while following the protocol described in the previous paragraph. Any races found during this protocol are then deleted from the callback race table. There are two more race conditions that arise due to the looseness of message ordering in SHORE. The first is related to copy table management. Let us assume that page P is purged from a client A, and that this fact is piggybacked in a message M sent from client A to the server. Also, assume that, before message M arrives at the server, a transaction on client A attempts to read page P and, as a result, sends a read request for P. A purge race occurs if the read request arrives at the server before the purge request in

message M. If we don’t handle this case correctly, the purge request will delete page P from the copy table although the page is cached at client A. The second race condition relates to lock deescalation requests; this race is similar to the callback race and is handled in a similar manner. Rather than getting into the details of how these race conditions are handled, we will just sketch the general idea behind our handling of such races. Our approach is to detect the occurrence of such races and to temporarily remember the race until a corrective action is taken. Due to the nature of the race conditions, the system is able to detect their occurrences using existing status information (like the fact that there is a pending remote read request in the case of the callback race). As a result, our handling of these race conditions involved no space overhead for storing additional status information.

4.3 Hierarchical Cache Consistency In SHORE, the system-wide locking granularity can be set to any of the following levels: volume, file, page, or object. Applications can request locks at any granularity coarser than or equal to the system-wide granularity. In effect, applications can choose the appropriate locking granularity on a per-workload or per-operation basis. In this section, we describe how our cache consistency algorithm was extended to support hierarchical locking. First, we describe a simplified version, where all explicit lock requests at levels higher than the object level are always propagated to the server; next, we present our final optimized version where page read locks are treated the same way as object read locks, i.e., a page read lock is acquired only at the client if the whole page is cached. Before we proceed, we need to clarify the relationship between hierarchical and adaptive locking. The purpose of adaptive locking is to reduce the message overhead (by saving object-level write lock requests); it is not visible to applications. Adaptive page locks are acquired internally by the system as a result of object write requests. In contrast, hierarchical locking is visible to applications; e.g., a transaction may explicitly request an EX page lock and, once granted, this lock cannot be deescalated. Hierarchical locking thus allows application programmers to explicitly control the locking granularity if they know what the appropriate granularity should be.

436

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

4.3.1 Nonhierarchical Callbacks

4.3.2 Hierarchical Callbacks

As described earlier, a request for an object-level read lock is not propagated to the server if the object is cached locally. This is not practical for file locks however, as in general, 4 most files are only partially cached. Therefore, our algorithm always acquires explicitly requested file locks at both the client and the server. In this subsection, we will assume that explicit page-level locks are treated the same way as file locks as well, so we will only discuss files. We will say that a file is “cached” at a client if at least one page of the file is cached there. The copy table is then extended to keep track of files as well as pages. An explicit file lock request is served by acquiring the lock at the requesting client first and then propagating the lock request to the server. The mode of the request can be any one of the five supported modes. Despite this protocol, a file may be IS-locked at a client without such a lock appearing (or having been requested) at the server. This can happen because a transaction that reads a locally cached object does not send any request to the server; the transaction acquires a local-only SH lock on the object—and, thus, an implicit IS local lock on the file as well. When it receives a request for an EX file lock, the server will acquire the lock on the file and, then, send file-level callbacks to all the clients (except the requesting one) that are caching the file. Each such file callback will first acquire an EX file lock at the client; it will then search the client cache, purging any pages belonging to the file, and, then, send an acknowledgment to the server. Of course, like an object callback, a file callback may have to wait due to a conflicting file lock. If this occurs, the callback will send the server a list of all local transactions holding locks on the file, as before. Normally, these will be local IS file locks that do not appear at the server. Therefore, in order to check for deadlocks, the server thread of the calling-back transaction downgrades its EX file lock to SIX, acquires IS file locks on behalf of the transactions contained in the “callback-blocked” reply, and, then, tries to upgrade its SIX file lock back to EX mode. At this point, the callback conflicts have been replicated at the server, so the deadlock detector can now be invoked to check for deadlocks. Before we conclude this section, we have to justify its title, which stems from the fact that a callback request for an item I never blocks at a granularity level higher than the level of I. The reason for this is as follows: A callback that arrives at a client implicitly requests IX locks on the ancestors of I (since it requests an EX lock on I). In principle, then, the callback thread could potentially block due to a SH, SIX, or EX lock held by a local transaction at the client. This cannot happen, however, because the conflicting transaction would have to be holding the same conflicting lock at the server as well, which is not possible since the calling-back transaction acquired an EX lock on item I (and thus implicit IX locks on its ancestors) at the server before starting the callback operation. Since a callback for item I cannot encounter a lock conflict at a level higher than the level of I, callbacks do not actually try to acquire such higher granularity locks.

The algorithm described so far can be optimized further by avoiding the propagation of explicit page-level IS and SH lock requests when the page is already cached at the requesting client. To be precise, the propagation of an SH page lock can be avoided only if the page is “fully” cached, where a page is said to be fully cached if it is cached and contains no “unavailable” objects. If a page does contain “unavailable” objects, then transactions from other clients may hold EX locks on these objects, so an SH page lock request must check for such conflicts at the server. In contrast to files, this optimization is practical because it is cheap to check whether a page is fully cached. The impact of this optimization on the rest of the algorithm is two-fold. First, when the server receives an explicit IX or SIX page lock request, it must send callbacks to all clients (except the requesting client) that are caching the page. This is necessary because client transactions may now hold SH page locks that have not been requested from the server. To simplify the implementation, the callbacks generated in this case are actually treated like object-level callbacks, where the requested object is a special “dummy” object (and each page has its own reserved dummy object). A successful dummyobject callback will invalidate the dummy object of the specified page (if not the whole page). The state of the dummy object must be taken into account by a client that tries to determine whether or not a page is fully cached. The second impact of the above optimization is on the operation of object-level callbacks. A callback for an object X that arrives at a client A will first try to obtain a page-level EX lock for the purpose of invalidating the whole page (as before). If this lock cannot be granted immediately, then the callback will try to get an IX lock on the page. In contrast to the simplified algorithm described in the previous subsection, this lock may now block due to the existence of a localonly SH page lock held by a local transaction. A “callbackblocked” reply will be sent back in this case in order to replicate the SH page lock at the server. After the callback (eventually) acquires the IX page lock, it will proceed by requesting the object-level EX lock, as before. This protocol implies that an object-level callback may block at the page level, the object level, or both (hence, the title “Hierarchical Callbacks”); as a result, both page and object locks may now have to be replicated at the server. Implementing hierarchical callbacks was not a trivial task. As mentioned already, before starting a callback operation for an object X, the calling-back transaction obtains an EX lock on X. Because of this, the number of possible server lock states that can arise with respect to object X during the callback operation is limited. However, the number of possible server lock states for the page P that contains object X is much greater, as the calling-back transaction usually holds only an IX lock on P (so, for example, many transactions can be calling back different objects of the same page concurrently). The result is that, during the replication at the server of a page lock, many different scenarios must be taken into account. In order to give a flavor of the things that can go wrong (and must be avoided, of course), we will use the example of Fig. 4 again, with one modification: Transaction C1 holds a local SH lock on (the

4. The same is true of volumes. In fact, volumes will be ignored in the remainder of this section, as they are always treated the same way as files.

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

whole) page P at client C, rather than an SH lock only on the requested object X. Before the “callback-blocked” message arrives from client C, thread B1,S waits behind A1,S, as before. The “callback-blocked” message will indicate that a conflict has occurred at the page level, so, upon its arrival, thread A1,S will try to acquire an SH page lock on behalf of C1,S. To do so, A1,S downgrades its page and object locks to IS and SH modes, respectively, then grants the SH page lock to C1,S, and, finally, makes itself an upgrader at the page level (A1,S cannot be an upgrader for both the page lock and the object lock because a transaction can wait only for one lock at a time). Thread B1,S is, thus, able to “sneak in” and acquire its waited-for SH lock on object X and, then, send X to client B, thus violating the second objective described in Section 4.2.2. To fix this problem, our algorithm is able to detect when violations of the second objective occur. In such cases, the calling-back thread (recursively) repeats the whole callback operation before it grants write permission to its master thread at the client. In our example, this means that, once its page-level upgrade succeeds (i.e., after the conflicting transaction C1 terminates), the calling-back thread A1,S will reacquire the EX lock on object X and, then, send a second round of callbacks for X to clients B and C.

4.4 Large and Flexible Objects In our discussion so far, we assumed that objects are smaller in size than a disk page and their sizes are fixed. This section describes potential approaches to relaxing these restrictions. We begin by considering the handling of large objects, i.e., objects that span multiple pages. SHORE structures large objects in a hierarchical manner, storing them as a tree of pages. The bottom layer of the hierarchy for a large object consists of a number of data pages that store the actual content of the object; the higher layers form a B-tree-like index that is used to facilitate access to arbitrary byte ranges within a large object. The data and the index pages of a large object are private to the object (i.e., they are not shared with other small or large objects). In SHORE, each large object also has a “header” (or largeobject descriptor); it lives on a page with other small objects or large-object headers and points to the root of the tree of pages.5 Thus, access to large objects can be controlled by locking their headers, using the PS-AA algorithm as if the large-object headers were small user-level objects. However, there remains the issue of cache coherency for the actual data and index pages of large objects. A callback-based approach can be used in this case as well. According to this approach, to access a large object for read, a client first accesses the header of the object in SH mode using PS-AA. Next, the client fetches the individual large-object pages that it needs from the server. However, large-object pages that are already cached at the client do not need to be fetched from the server, as such cached pages are valid and can be accessed without server intervention. To update a large object, a client first locks the header of the object in EX mode using PS-AA again, which implies that the header will be called back from all other clients that are caching it. Then, for each individual large5. In fact, the large-object header can be the actual root of the hierarchy with the restriction that it must be smaller than a full page.

437

object page that needs to be updated, the updating client asks the server to callback the page from all other clients that are caching it. The server calls back the page and, then, grants update permission to the requesting client. Notice that no locks need to be acquired on large-object pages (either for reads or for updates) as the lock acquired on the large-object header provides the necessary access protection. The approach of locking the large-object headers is easy to implement and provides full serializability (i.e., degree three consistency). If, however, lower degrees of consistency are acceptable, then transactions may lock individual pages of large objects (rather than whole objects through their headers) using the hierarchical version of PSAA; more details of this approach can be found in [10]. We now turn to the handling of size-changing updates. As described so far, SHORE allows concurrent updates of the same page by different clients and uses a redo-at-server approach for propagating client updates to the server. If only fixed-length objects are considered, and object creation is ignored, redoing concurrent updates at the server is fairly straightforward. However, when objects can grow in size, or are newly created on a page, redo-at-server becomes more complicated. For example, if two different objects on a given page are both increased in size by concurrent updaters at different clients, it is possible that a subsequent propagation of the two updates will cause the page to overflow at the server. Such overflows can be handled in a number of ways. For example, a standard forwarding technique (à la [3]) can be used at the server. Another possible approach is to use a space-reservation protocol; e.g., potential updaters could be given a maximum allowable update size by the server, beyond which they would have to perform forwarding themselves as if a page overflow has occurred (even if it really has not). Yet another approach would be to treat the free space of each page as an extra object within the page and require clients to lock the free space before performing a size-growing update. A fourth solution could require the clients to inform the server about the magnitude of objectgrowing updates; the server would keep track of the free space of remotely cached pages and, based on this information, it would grant client requests for object enlargements or deny such requests, in which case the clients would have to perform forwarding themselves. Finally, the problem of size changing updates can also be handled by employing the “physical” locks idea of [19], which was discussed briefly in Section 2. With this approach, a client is guaranteed to receive the most recent version of a page before updating the page and, thus, it can safely decide if there is enough space within the page for the update. (More details about the application of physical locks in a clientserver environment can be found in [27].)

5 EXPERIMENTAL PERFORMANCE STUDY As detailed in the preceding section, we now have the hierarchical PS-AA algorithm running in the SHORE system. In order to explore the trade-offs between page-level and object-level cache consistency in an actual system, we conducted an experimental performance study by running

438

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

TABLE 1 EXPERIMENTAL PLATFORM CONFIGURATION Quantity

Description

Value

NumApplications ClientBufSize ServerBufSize PeerServerBufSize

Number of concurrent applications Per-client cache size Server cache size Peer server cache size (for peer-server experiments) Size of a page Size of database in pages Number of objects per page

10 25% of DB size 50% of DB size 25% of DB size

PageSize DatabaseSize ObjectsPerPage

4,096 bytes 11,250 (45 MB) 20 objects

TABLE 2 WORKLOAD PARAMETER DEFINITIONS AND SETTINGS FOR APPLICATION n Parameter

Meaning

TransSize PageLocality HotBounds

Mean no. of pages accessed per trans. # objects accessed per page (min-max) Page bounds of hot range

ColdBounds HotAccProb HotWrtProb ColdWrtProb PerObjProc

Page bounds of cold range Prob. of accessing a hot page Prob. of updating a hot object Prob. of updating a cold object Mean time of application processing per object read (doubled if object is updated)

SHORE on an IBM SP2 shared-nothing parallel machine. During our study, SHORE’s system-wide locking granularity was first set to the page level, causing SHORE to run as a pure page-server system, using the PS algorithm from [5]. Next, the system-wide locking granularity was set to the object level, but with adaptive locking disabled, thus emulating the PS-OA algorithm. Finally, adaptive locking was enabled in order to run the full PS-AA algorithm.

5.1 Experimental Platform Configuration Table 1 describes our experimental platform configuration. In all our experiments, we ran 10 concurrent instances of our application program. Application programs create and then execute transactions one after another. Transactions consist of a string of object references, and they are created according to the workload specifications to be described shortly. When a transaction aborts, it is reexecuted with the same object reference string. As mentioned in Section 1.2, our performance study includes both client-server and peer-servers experiments. In client-server configuration, we used 11 SP2 nodes, each running a SHORE peer-server process. One of the peer servers acted as a real server, managing a database stored on a disk attached to its host node; a second local disk was used to store the log. The rest of the peer servers acted as clients: each was linked with one of the application programs. In peerservers configuration, the database was partitioned among 10 peer servers, each running on a different SP2 node and serving one of the applications. The exact partitioning of the database depends on the workload and it will be described later. Notice that, given the number of applications, the peer-servers configuration has less aggregate memory and CPU power than the corresponding client-server configuration (as it does not include a dedicated server node), but more disk power (as each participating node has its own database and log disks).

HOTCOLD

UNIFORM

HICON

90 or 30 1-7 or 8-16 p + 1 to p + 450, p = 450(n − 1) rest of DB 0.8 0.02 to 0.5 0.02 to 0.5 2 msec

90 or 30 1-7 or 8-16 -

90 or 30 1-7 or 8-16 1 to 2,250

whole DB 0.02 to 0.5 2 msec

rest of DB 0.8 0.02 to 0.5 0.02 to 0.5 2 msec

5.2 Workload Model The access pattern for each application is generated using the workload model detailed in Table 2. The size of the transactions for a given application is controlled by two parameters: TransSize, which specifies the average number of pages accessed by a transaction, and PageLocality, which specifies a range of values for the number of objects to be accessed per page by a transaction. To generate different sharing patterns, two ranges of database pages are specified for each application: a hot range and a cold range. The probability of a given page access being directed to a page in the hot range is explicitly specified; other accesses go to cold range pages. For both ranges, the probability that an object read access within the range leads to an update of the object is also specified. The workload parameters also include the average time that an application spends for processing an object once the object is resident in the local cache and locked in the proper mode; this cost is doubled if the object is updated. The data sharing patterns inherent in the workloads of Table 2 were chosen due to their effectiveness as performance discriminators for client caching alternatives6 [4], [11], [5]. The HOTCOLD workload has a high degree of locality per application and a moderate amount of sharing and data contention among applications. The UNIFORM workload is a low-locality workload with no particular per-application data affinity; its level of interapplication data contention is, therefore, somewhat higher than in the HOTCOLD workload. HICON is a workload with much higher data contention. The HICON workload is similar to the skewed workloads 6. A PRIVATE workload was also used in our earlier simulation study. We do not include PRIVATE in our current study because PS and PS-AA perform identically in this workload, as it contains no data contention whatsoever.

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

439

in order to illustrate the additional gains of PS-AA (which is adaptive with respect to both write locks and callbacks) over PS-OA (which is adaptive with respect to callbacks only). PSOA will then be dropped from the rest of our graphs here, as we never found it to perform better than PS-AA.

5.3 Client-Server Experiments

Fig. 6. HOTCOLD: transSize = 90, pageLocality = 4 (avg).

Fig. 7. HOTCOLD: transSize = 30, pageLocality = 12 (avg).

used to study shared-disk transaction processing systems; it is included to expose the performance trade-offs that would arise in the (unlikely) event of very high OODBMS data contention. As indicated in Table 2, our study centered around two different settings for the transaction size and page locality parameters. We used a transaction size setting of 90 pages, together with a page locality range of one to seven (averaging four) objects, and we also used a transaction size setting of 30 pages with a page locality range of 8 to 16 (averaging 12) objects. Both settings provide overall average transaction lengths of 360 objects. By varying these two parameters together in this way for each workload, we were able to explore the impact of a higher or lower page locality while keeping the average transaction length constant in terms of the number of objects accessed. Page locality is important because it is a key factor in determining the appropriate locking granularity [5], [27]. During the presentation and analysis of our performance results, we will focus mainly on the trade-offs between PS and PSAA. PS-OA will be discussed only in the first two experiments

We begin by looking at the results of the HOTCOLD workload for the client-server case. In this workload, each application has an affinity for its own preferred region of the database, directing 80 percent of its accesses to that specific region. Figs. 6 and 7 show the HOTCOLD throughput results for the low and high page localities, respectively. A first remark (that applies in general) is that throughput goes down as the write probability increases because more updates bring more data contention and more work, e.g., more messages for write lock requests and callbacks, more I/Os for installing updates to disk, and also more application processing (as the PerObjProc parameter in Table 2 is doubled when an object is updated). We now focus on the results shown in Fig. 6 for the low page locality case. Initially, when the write probability is very low, object updates are few and data contention is low. We see that, at 0.02 write probability, the basic page server (PS) is only slightly worse than the fully adaptive page server (PS-AA) and the adaptive callback page server (PSOA). As the write probability is increased, however, pagelevel data contention grows rather rapidly. Thus, PS experiences contention due to false sharing that PS-AA avoids by deescalating locks when necessary. PS-OA manages to avoid high contention as well by using object-level locks at all times. However, since individual object-level write locks involve messages, the performance of PS-OA worsens slightly at high write probabilities. In this region, PS-AA wins because only about 10 percent of the pages that a given transaction updates turn out to require object locks; the rest are page-locked (using adaptive page locks, as described in Section 4.1.2), saving a number of write lock messages compared to PS-OA. Nevertheless, PS-OA stays very close to PS-AA for the entire range of the write probability, because messages in the SP2 are relatively cheap and the difference in the message count between PS-OA and PS-AA is not very large, due to the low page locality. Next, we turn to Fig. 7, the case with much higher page locality. The relative ordering of PS-AA and PS-OA is the same as before, but PS does very well here. This is due to the much higher page locality, which enables transactions to process the same number of objects with fewer page accesses, yielding much less data contention for PS than before. With its contention problems largely swept aside, PS benefits significantly from the fact that, due to its page-level nature, it locks and calls back a number of objects at once. Put differently, the alternatives that request locks on a perobject basis suffer more here, as there is now a higher relative message overhead for operating at the object level. This explains why PS-OA performs worse here as the write probability is increased: with respect to object writes, PSOA always operates at the object level and, as a result, the difference in the message count between PS-OA and PS is much higher here than in the low page locality case. In

440

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

Fig. 8. UNIFORM: transSize = 90, pageLocality = 4 (avg).

Fig. 10. HICON: transSize = 90, pageLocality = 4 (avg).

Fig. 9. UNIFORM: transSize = 30, pageLocality = 12 (avg).

Fig. 11. HICON: transSize = 30, pageLocality = 12 (avg).

contrast, because of its adaptive nature, PS-AA is still the best alternative. PS-AA is slightly better than PS at medium write probabilities, but loses its advantage when the write probability is further increased. At first, such behavior seems counter-intuitive—one would expect PS’s performance to degrade at high write probabilities due to contention. The reason for this behavior is that, when both the page locality and the write probability are high, it is very likely that, if a page-level conflict occurs, then an object-level conflict will also occur. As a result, fine-grained locking provides only marginally higher concurrency than page-level locking. Moreover, any concurrency benefits are gained at the cost of extra messages for sometimes operating at the object level. The net effect in this experiment is that PS-AA and PS perform the same at high write probabilities. The second workload to be considered is the UNIFORM workload, where all applications access data uniformly throughout the entire database. Due to the absence of perapplication data skew, interapplication sharing of pages is more likely, therefore giving the alternatives that allow object-level sharing more to gain. This is shown in Fig. 8,

which shows the low page locality results for this workload (as mentioned earlier, we will not discuss PS-OA in the rest of our study here). In this figure, PS-AA does considerably better with respect to PS than in Fig. 6. The high page locality results for the UNIFORM workload are shown in Fig. 9. In contrast to Fig. 7, PS-AA manages to stay ahead of PS here, even at high write probabilities, because messages are cheaper here than in HOTCOLD and, as result, the extra messages sent by PS-AA have minimal performance impact. Messages are cheaper in the UNIFORM workload because the lack of access locality causes the server disk to be more of a bottleneck; thus, the server’s CPU utilization is kept very low, which, in turn, reduces the cost of messages. Finally, we turn to the HICON workload. In HICON, all applications have the same data access skew and the degree of data contention is very high as a consequence. Figs. 10 and 11 show the results for the HICON workload. In the low page locality case, the very high contention causes PS to perform much worse than PS-AA. PS-AA outperforms PS in the case of high page locality as well. However, its gains diminish as the write probability increases, and PS-AA

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

441

actually becomes slightly worse than PS at 0.5. The explanation is essentially the same as in the HOTCOLD case: As the write probability increases, PS-AA starts sending more messages than PS without providing significantly higher concurrency. This message count difference between PS-AA and PS is higher here than in UNIFORM or HOTCOLD, as PS-AA operates at the object level for a larger percentage of pages.

sharing under low page localities. In contrast, under high page locality and high update probability, the degree of false sharing is low (that is, if a page-level conflict arises, an object-level conflict is very likely to arise as well), and PSAA cannot offer higher concurrency than PS; in fact, PS-AA can become worse than PS in this case due to its higher message overhead.

5.4 Summary and Comparison with Simulation Results

A critical factor for the performance of multiple-server systems (based either on the peer-servers or the client-server model) is the method used for detecting and resolving distributed deadlocks. As mentioned in Section 3.3, SHORE currently uses a simple mechanism based on lock-wait timeouts. The performance of many deadlock-handling strategies, including timeouts, was investigated in [2] in the context of a centralized DBMS. It was shown there that timeouts are, in general, a rather poor choice for dealing with deadlocks, as they may allow a real deadlock to persist for a while (if the timeout interval is too long) or they may detect false deadlocks (if the timeout interval is too short). As a result, the performance of the timeout method is crucially dependent on the correct choice of the timeout interval. Furthermore, it was shown in [2] that the optimal timeout value varies significantly with the workload. A simple adaptive heuristic was suggested there for dynamically selecting a reasonable timeout value: The timeout value was set to the sum of the average waiting time per lock conflict plus the standard deviation of this waiting time. In our implementation, we have used the same heuristic, but we inflated the obtained timeout values by a factor of 1.5. The reason for using longer timeout intervals is that, in our system, deadlocks that involve data owned by a single server are actually detected by the owning server (see Section 4.2.1). Inflating the timeout value thus helps in reducing the number of false deadlock detections. With these remarks about timeouts in mind, we now proceed to present and analyze the peer-servers results. As before, we start with the HOTCOLD workload. In this workload, the database is distributed among the participating peer servers so that each peer server owns the hot area of its associated local application. In addition, the portion of the database that is cold with respect to all of the applications is partitioned into 10 equal-sized pieces, and each peer server is given one piece. The results of the low page locality case are shown in Fig. 12. The figure shows, in dashed lines, the results of the corresponding client-server experiment as well. First, we observe that PS is much worse (compared to PS-AA) in the peer-servers case than in the client-server case. This is mainly due to timeouts—PS suffers higher data contention than PS-AA, so its performance is influenced much more heavily by the inadequacy of timeouts in dealing with deadlocks. Notice that the problems incurred by timeouts (i.e., late and/or false deadlock detection) can also occur (though to a lesser degree) even with more advanced distributed deadlock detection algorithms, and such algorithms involve additional overhead. Therefore, even if another strategy were used instead of timeouts, PS would still suffer more than PS-AA due to distributed deadlocks. Finally, despite the poor perform-

The client-server results presented above are similar to our earlier simulation results [5]. Some differences exist, however, due to differences in the various system costs and in parameter settings. The most notable discrepancies exist in the high-page-locality/high-write-probability cases (for all the workloads). Specifically, in those cases, PS-AA was in general worse than PS in our simulation results, whereas the opposite is true in our SHORE/SP2 results. There are two factors that contribute to this reordering of PS and PSAA. First, messages in the SP2 are about three times faster than in the simulator. The second contributing factor is the value of the PerObjProc parameter, which has been increased here by a factor of six compared to its setting in [5]. Increasing the amount of application processing performed at the clients off-loads the server’s CPU somewhat, which, in turn, contributes further to making messages cheaper in our SHORE/SP2 system. As a result, PS-AA performs better (compared to PS) in the present study, since the extra messages it sends in the high-page-locality/high-writeprobability cases do not harm its performance as much as in the simulator. For the same reason (i.e., cheaper messages), PS-OA was much closer to PS-AA in the current study than in the simulation study. Overall, the analyses of both the simulation and the SHORE/SP2 results are based on the same arguments, and the observed differences in the two studies are consistent with the differences in the workload parameters and the relevant system costs. In summary, the general trends observed in both studies are the following. First, PS-AA is better than PS-OA. The performance differences among these two hybrid page-server alternatives are due to differences in their message counts. Specifically, PS-AA usually maintains cache consistency at the page level, with the percentage of pages that require object-level locking at any given instance always being relatively small. As a result, PS-AA avoids a number of messages that are required by PS-OA for requesting object-level write locks from the server. These message count differences increase as the write probability and/or the physical page locality is increased. In general, PS-AA is better than PS as well. Compared to PS, PS-AA trades a higher message overhead for lower data contention. As mentioned above, PS-AA usually operates at the page level, thus keeping its message overhead close to that of PS. However, switching to objectlevel locking for the relatively small number of pages that require fine-grained sharing at any given instance can lead to a great reduction in data contention and, thus, to significant performance gains for PS-AA. These gains increase as the page locality decreases, due to the high degree of false

5.5 Peer-Servers Experiments

442

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

Fig. 12. HOTCOLD, Peer-Servers: transSize = 90, pageLocality = 4 (avg).

Fig. 13. HOTCOLD, Peer-Servers: transSize = 30, pageLocality = 12 (avg).

ance of the timeout method in cases of high contention, timeouts are used in practice by a number of existing distributed systems due to their simplicity. In addition to the differences between PS and PS-AA, it is also interesting to observe the trade-offs between the two architectural alternatives, i.e., the client-server and peerservers models. Given the partitioning of the database described above, the peer-servers approach is able to save a large number of disk I/Os and messages. Disk I/Os are eliminated almost completely, as each peer server is able to keep its own portion of the database cached in its buffer pool.7 This is so because the pages owned by a peer server A are accessed more often than other remote pages that happen to appear in A’s cache and are, thus, less likely victims for replacement. This is obviously true for A’s hot pages, as such pages are accessed very often by A’s local application. It is also true for A’s cold pages, as such pages are accessed by both A’s local application and by remote peer servers that act as clients with respect to A. Peerservers also require fewer messages; for example, no remote write-lock requests are needed when a peer server updates its own data. Despite these savings, we see in Fig. 12 that peer-servers PS-AA performs worse than clientserver PS-AA at low write probabilities. In this region, the client-server approach performs well, as the number of client-server interactions is small. As a result, most of the processing takes place at the clients, where each application has the associated client CPU dedicated to it. In contrast, each peer server must time-share its CPU between its local application and requests arriving from remote peer servers. Thus, although fewer messages are sent in the peer-servers case, they are more expensive than client-server messages due to higher CPU utilizations at the peer servers. As the write probability increases, however, clients bring more work to the server (e.g., more messages and more I/Os for resending invalidated pages to clients and for installing client updates). As a result, the client-server system suffers

due to resource contention at the server, and peer servers have more to gain in terms of messages and I/Os. Naturally, then, peer-servers PS-AA performs better than clientserver PS-AA at higher write probabilities. The difference between the two configurations is maximized at 0.2 write probability; the difference decreases somewhat beyond this point because peer-servers PS-AA starts suffering due to the poor performance of timeouts. Timeouts are also the reason why peer-servers PS performs worse than clientserver PS for the entire range of write probabilities. We now turn to the case of HOTCOLD with high page locality (Fig. 13). As explained in Section 5.3, high page locality leads to lower data contention, especially for PS. As a result, in Fig. 13, PS performs almost as well as PS-AA, even in the peer-servers case. High page locality also leads to lower overhead (in terms of both messages and I/Os) since fewer pages are accessed per transaction. Consequently, the server in the client-server system is utilized less here than in Fig. 12, and the peer-servers approach enjoys smaller relative benefits here. In fact, as shown in Fig. 13, the peer-servers approach is much worse than the client-server approach under high page locality due to the CPU contention at the peer servers discussed in the last paragraph. In the UNIFORM workload, the whole database is partitioned into 10 equal-sized pieces, with each piece being assigned to one of the peer servers. Fig. 14 shows the results for the low page locality UNIFORM workload. We see that the peer-servers approach proves very beneficial for PS-AA here, as it manages to eliminate the disk bottleneck that exists in the client-server case even at low write probabilities. As in the case of HOTCOLD, this is mainly due to the partitioning of the database, which enables each peer server to cache much of its own portion of the database in its local buffer pool. Peer-servers PS also performs reasonably well at very low write probabilities. However, it degrades dramatically as the write probability increases due to the poor behavior of timeouts. (UNIFORM is a higher contention workload than HOTCOLD, so PS suffers more here than in Fig. 12.) Due to this performance collapse, we didn’t try to run peer-servers PS at write probabilities higher than 0.1.

7. As a result, the I/O benefits seen for the peer-servers approach here arise mainly from its superior use of memory, not from the presence of extra disk arms.

ZAHARIOUDAKIS AND CAREY: HIERARCHICAL, ADAPTIVE CACHE CONSISTENCY IN A PAGE SERVER OODBMS

Fig. 14. UNIFORM, Peer-Servers: transSize = 90, pageLocality = 4 (avg).

443

In order to implement our algorithm, several extensions were required. One of the required extensions was in the architectural dimension; SHORE is based on a peer-servers architecture, where each SHORE “client” machine hosts a multithreaded server process to manage a shared local cache and any locally stored data. Another required extension was support for the locking model provided by SHORE; in SHORE, in addition to the implicit locks obtained when objects are accessed, applications are permitted to explicitly lock higher-level granules (e.g., volumes, files, or pages). After explaining the impact of these extensions on our algorithm, we presented results from a series of performance experiments. During these experiments, we ran a number of workloads using SHORE in both clientserver and peer-servers configurations. Our results confirmed the practicality and efficiency of our adaptive locking scheme. In all of the experiments, PS-AA, which is adaptive with respect to both locking and callbacks, was able to surpass, or at least track, the performance of both the basic page server (PS), which performs locking and callbacks at the page level, and our other algorithm (PSOA), which employs adaptive callbacks but always locks at the object level.

ACKNOWLEDGMENTS This research is sponsored by the U.S. Advanced Research Project Agency, ARPA order number 018 (formerly 8230), monitored by the U.S. Army Research Laboratory under contract DAAB07-91-C-Q518. This work was performed while Markos Zaharioudakis was a graduate student in the Computer Science Department at the University of Wisconsin-Madison.

REFERENCES Fig. 15. UNIFORM, Peer-Servers: transSize = 30, pageLocality = 12 (avg).

Finally, Fig. 15 shows the results for the high page locality UNIFORM workload. The explanation of Fig. 15 is essentially the same as that of Fig. 13.

6 CONCLUSIONS In an earlier paper [5], we presented a family of algorithms to enable fine-grained cache consistency to be supported in the context of a page server OODBMS. In addition, that paper presented simulation results comparing the new algorithms to the basic page server and object server architectures that many existing systems employ. Our simulation results favored an approach called PS-AA that was able to dynamically adjust the granularity of both locking and callbacks. As a result, we decided to implement the proposed adaptive cache consistency algorithm in the context of the SHORE persistent object system. In this paper, we described the outcome of that implementation effort. We explained the ways that we generalized our algorithm for use in SHORE, discussed several of the stickier aspects of our implementation, and presented empirical performance results that were obtained from our SHORE implementation.

[1] A. Adya, R. Gruber, B. Liskov, and U. Maheshwari, “Efficient Optimistic Concurrency Control Using Loosely Synchronized Clocks,” Proc. ACM-SIGMOD Conf., San Jose, Calif., May 1995. [2] R. Agrawal, M. Carey, and L. McVoy, “The Performance of Alternative Strategies for Dealing with Deadlocks in Database Management Systems,” IEEE Trans. Software Eng., vol. 13, no. 12, Dec. 1987. [3] M. Astrahan et al., “System R: Relational Approach to Database Management,” ACM Trans. Database Systems, vol. 1, no. 2, 1976. [4] M. Carey, M. Franklin, M. Livny, and E. Shekita, “Data Caching Tradeoffs in Client-Server DBMS Architectures,” Proc. ACMSIGMOD Conf., Denver, Colo., June 1991. [5] M. Carey, M. Franklin, and M. Zaharioudakis, “Fine-Grained Sharing in a Page Server OODBMS,” Proc. ACM-SIGMOD Conf., Minneapolis, Minn., May 1994. [6] M. Carey et al., “Shoring Up Persistent Applications,” Proc. ACMSIGMOD Conf., Minneapolis, Minn., May 1994. [7] R. Cattell, Object Data Management. Reading, Mass.: Addison Wesley, 1991. [8] D. DeWitt, P. Futtersack, D. Maier, and F. Velez, “A Study of Three Alternative Workstation-Server Architectures for Object-Oriented Database Systems,” Proc. 16th VLDB Conf., Brisbane, Australia, Aug. 1990. [9] M. Carey, D. DeWitt, J. Richardson, and E. Shekita, “Storage Management for Objects in EXODUS,” Object-Oriented Concepts, Databases, and Applications. Addison-Wesley, 1989. [10] EXODUS Project Group, “EXODUS Storage Manager Architectural Overview,” EXODUS Project Document, Computer Sciences Dept., Univ. of Wisconsin-Madison, (available by ftp from ftp.cs.wisc.edu), 1993.

444

[11] M. Franklin and M. Carey, “Client-Server Caching Revisited,” Proc. Int’l Workshop Distributed Object Management, Edmonton, Canada, Aug. 1992. [12] J. Gray “Notes on Database Operating Systems,” Operating Systems: An Advanced Course. Springer-Verlag, 1979. [13] J. Howard et al, “Scale and Performance in a Distributed File System,” ACM Trans. Computer Systems, vol. 6, no. 1, Feb. 1988. [14] A. Joshi, “Adaptive Locking Strategies in a Multi-Node Data Sharing System,” Proc. 17th VLDB Conf., Barcelona, Spain, Sept. 1991. [15] A. Kemper and D. Kossmann, “Dual-Buffering Strategies in Object Bases,” Proc. 20th VLDB Conf., Santiago, Chile, 1994. [16] C. Lamb, G. Landis, J. Orenstein, and D. Weinreb, “The ObjectStore Database System,” Comm. ACM, vol. 34, no. 10, Oct. 1991. [17] T. Lehman and M. Carey, “A Concurrency Control Algorithm for Memory-Resident Database Systems,” Proc. Third Int’l. Conf. Foundations of Data Organization, Paris, June 1989. [18] K. Li and P. Hudak, “Memory Coherence in Shared Virtual Memory Systems,” ACM Trans. Computer Systems, vol. 7, no. 4, Nov. 1989. [19] C. Mohan and I. Narang, “Recovery and Coherency-Control Protocols for Fast Intersystem Page Transfer and Fine-Granularity Locking in a Shared Disks Transaction Environment,” Proc. 17th VLDB Conf., Barcelona, Spain, Sept. 1991. [20] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, “ARIES: A Transaction Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging,” ACM Trans. Database Systems, vol. 17, no. 1, Mar. 1985. [21] J. O’ Toole and L. Shrira, “Oportunistic Log: Efficient Installation Reads in a Reliable Storage Server,” Proc. First USENIX Symp. Operating Systems Design and Implementation, Monterey, Calif., 1994. [22] J. O’Toole and L. Shrira, “Hybrid Caching for Large Scale Object Systems,” Proc. Sixth Workshop Persistent Object Systems, Tarascon, France, Sept. 1994. [23] Y. Tay, N. Goodman, and R. Suri, “Locking Performance in Centralized Databases,” ACM Trans. Database Systems, vol. 10, no. 4, Dec. 1985. [24] Y. Wang and L. Rowe, “Cache Consistency and Concurrency Control in a Client/Server DBMS Architecture,” ACM-SIGMOD Conf., Denver, Colo., June 1991. [25] S. White and D. DeWitt, “Implementing Crash Recovery in QuickStore: A Performance Study,” Proc. ACM-SIGMOD Conf., San Jose, Calif., May 1995. [26] W. Wilkinson and M. Neimat, “Maintaining Consistency of Client Cached Data,” Proc. 16th VLDB Conf., Brisbane, Australia, Aug. 1990. [27] M. Zaharioudakis and M. Carey, “Adaptive, Fine-Grained Sharing in a Client-Server OODBMS: A Callback-Based Approach,” ACM Trans. Database Systems, to appear.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 4, APRIL 1998

Markos Zaharioudakis received the BS degree in electrical and computer engineering from the National Technical University of Athens in 1991. He received the MS and PhD degrees in computer science from the University of Wisconsin, Madison, in 1993 and 1997, respectively. In 1997, he joined the IBM Almaden Research Center as a research staff member. Dr. Zaharioudakis is interested in issues related to database storage and transaction management, including concurrency control, cache management, and crash recovery for parallel/client-server databases. During his PhD at Wisconsin, he worked on the SHORE scalable heterogeneous object repository project, where he designed, evaluated, and implemented novel algorithms for highly concurrent cache consistency. His current work at IBM focuses on transactional issues for databases of semi-structured data. Dr. Zaharioudakis is a member of the ACM.

Michael J. Carey received the BS degree in electrical engineering and mathematics and the MS degree in electrical engineering (computer engineering) from Carnegie-Mellon University in 1979 and 1981, respectively. He received the PhD degree in computer science from the University of California, Berkeley, in 1983. He spent the next 12 years as a member of the Computer Sciences Department faculty at the University of Wisconsin-Madison. In 1995, he joined the IBM Almaden Research Center as a research staff member, where he currently manages a small group working on objectrelational database technology. Dr. Carey’s interests include different ways of combining object and database technologies, parallel and distributed database systems, and database system performance issues. At Wisconsin, he was a coprincipal investigator of the EXODUS extensible DBMS project and the SHORE scalable heterogeneous object repository project. During sabbatical visits to IBM Almaden in 1989 and 1993-1994, Dr. Carey worked on the Starburst extensible DBMS project and the Garlic heterogeneous multimedia information system project. His current work at IBM focuses on object-relational database system technology; his group is adding various new object features to the DB2 Universal Database (UDB) platform and studying ways of mapping these extensions naturally onto constructs provided by current object languages. Dr. Carey received an IBM Faculty Development Award in 1984, an Incentives for Excellence Award from Digital Equipment Corporation in 1986, and a U.S. National Science Foundation Presidential Young Investigator Award in 1987. He is a member of the IEEE and ACM, an associate editor of ACM Transactions on Database Systems and the VLDB Journal, served as secretary/treasurer for ACM SIGMOD during 1989-1997, and is a trustee of the VLDB Endowment.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.