A distributed commit protocol for a multicomputer system

June 3, 2017 | Autor: Cosimo Prete | Categoria: Distributed Computing, Computer Hardware, Computer Software, Generic Programming, Locus of Control, DATABASE MANAGEMENT SYSTEM, Distributed Transactions, DATABASE MANAGEMENT SYSTEM, Distributed Transactions

Share Embed

Denunciar este link

Descrição do Produto

718

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 5 , MAY 1990 Available

A.B

Actions Initiated

2F(X)

2F(X)-B

pJ

e

2F(Y)

2

I

I

I

1

0

1

2

3

4

ACCESS ROM; A-B

AtB

ACCESS ROM; 2F(X)-B

ZF(X)-B-ZF(Y)

Fig. 1. A timing diagram for a table-assisted multiplication-a approach. Available

-A+.B

2 ’

-B

!-2F(Y)

2 ’

USING L O C I C GATES TO ; IMPLEHWT

AtB;

A-B;

F(X)

Action Initiated

AB 1

I

I

SUM

pipeline

The second caveat is that Y as calculated according to ( 2 ) may be negative. We can proceed with the computation by deleting the negative sign of Y, this amounts to evaluating

2F(X)

-A-B. J

AB

2

I

;

B-A y = 2 . The product A B is now obtained through the following:

CARRY;

AB

C

F(Y)

= 2 F ( X ) - 2F(Y) -

A.

A B = 2 F ( X )- 2 F ( Y ) - f P y B - f P y A

No. of Logic

1

4

Level

2

Required

Fig. 2. A timing diagram for a table-assisted multiplication-a parallel approach requiring eight levels of ECL logic gates. the product expression A B in (10) can be reduced to

AB

2[F(X1)- F ( Y 1 )

+ X F ( X I- Y I ) ]- B .

(12)

We now evaluate the expression

XI -Y1

= x -xF

-Y +YF

XI -YI = x - Y (13) X I - Y I =B. Using the identities in (1 1 ) and (13), we obtain from ( 1 2 ) the following:

AB

(16)

Equations (4),(14),and (16)can be combined into one, given below:

= 2F(X1) - 2F(YI).

(17)

where f = 0 if Y is an integer (the least significant bits of A and B are identical); otherwise, f = 1 . And P y = 1 if Y is positive; otherwise, P y = 0.

V. CONCLUSION A novel approach to implementing multiplication with table lookups has been presented. To compute a product, two table lookups and four additions are required; some of these operations can be carried out concurrently. Two implementation schemes, among many, are presented. The size of the multiplication table is reduced from 22b to 2’, where b is the length of the operands. Due to the considerable reduction in required table size, multiplication with table lookups can be cost-effectively extended to wider operands. REFERENCES [I]

K. Hwang, Computer Arithmetic. New York: Wiley, pp. 201-206.

(14)

To illustrate this point, let A = 233 and B = 160. Then,

x = 233 - - + 160 2

-

196.5.

From (15), we get

A Distributed Commit Protocol for a Multicomputer System

XI

=

196,

X F = 0.5.

PAOLO ANCILOTTI, BEATRICE LAZZERINI, COSIMO ANTONIO PRETE, AND MAURIZIO SACCHI

Similarly, we have

233 - 160 y = = 36.5 2

Y I =36,

Abstract-The aim of this paper is to describe a distributed commit protocol suitable for multicomputer systems. We refer to a general programming environment which provides transactions as a program-

YF= O S .

Through table lookups, we obtain

F(X1) = 19306 F ( Y , ) = 666. The product A B is finally calculated:

AB

= (2)(19306)- (2)(666)= 37280.

The product expression in ( 14) indicates that, when X and Y contain fractions, the fractions can be ignored; the product is obtained by evaluating 2F(X,) - 2F(Y1).

Manuscript received August 5, 1987; revised October 14, 1988. This work was supported in part by the Minister0 della Pubblica Istruzione, Italy. P. Ancilotti is with the Scuola Superiore di Studi Universitari e Perfezionamento S. Anna, 56100 Pisa, Italy. B. Lazzerini and C. A. Prete are with the Istituto di Elettronica e Telecomunicazioni, Facold di Ingegneria, Universid di Pisa, 56126 Pisa, Italy. M. Sacchi is with Ing. C. Olivetti & Co. SpA DISSE, 56100 Pisa, Italy. IEEE Log Number 8932916.

0018-9340/90/0500-0718$01.00

0 1990 IEEE

719

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 5 , MAY 1990

ming tool. This environment is expected to be more dynamic than a database management system; in particular, we do not know how many and which processes will participate in a specific transaction. Therefore, a model of completely distributed transactions, without any hierarchical structure among the participant processes or any centralized locus of control, is proposed. Index Terms-Atomic transactions, commit protocol, decentralization of control, distributed systems, fault-tolerance, recovery mechanism, stable memory.

I. INTRODUCTION A major problem in a distributed data management system is how to achieve robust execution of distributed programs in spite of failures [4], [ 101. A common approach to this problem is to define an atomic transaction [15], [21], [28] as a unit that preserves consistency in spite of failures (e.g., processor crashes) and parallel executions. An atomic transaction is an all-or-nothing computation, that is either it installs a complete collection of changes in data or it installs no change, even in the presence of failures. Furthermore, an atomic transaction appears to take place indivisibly in the sense that the partial execution of a transaction is not visible to any concurrently executing computation. This may be obtained by means of a locking mechanism [l], [SI, [9], [13], [14]. If a transaction is performed successfully, it is called committed if a transaction is started but fails, then it is called aborted. The all-or-nothing property of an atomic transaction is obtained by using a recovery mechanism [14], [20] in such a way that all the actions of the transaction can be undone if it needs to be aborted following a hardware or software failure. The recovery mechanism makes use of a stable (not volatile) memory [20], whose contents survive processor failures (crashes). In case of multiprocess transactions a complex commit protocol is needed in order to guarantee that either all the participant processes commit, or all of them abort, the transaction. In a multicomputer system, participant processes can either be local to a single computer or spread over several computers. Therefore, a commit protocol must also take partial crashes (a single computer crash) and failures in the communication network into account. Several commit protocols have been proposed in the literature [8], [22]-[24], [27]. Most of them are different versions of the well-known two-phase commit protocol. In this paper, an efficient and completely distributed protocol for managing multicomputer transactions is presented. In the design of this protocol, the main aim was to obtain distribution of control and exploitation of maximum parallelism, at the cost of minimal overhead in terms of both message exchanges and stable writes. Indeed, decentralization of control [ 171, [ 181 supports both modularity and fault treatment in a multicomputer system. In particular, the absence of central coordination enhances system robustness by eliminating single points of catastrophic failure and enhances efficiency and parallelism by eliminating a potential bottleneck in the system. We refer to a general programming environment which provides transactions as a programming tool. We do not know how many and which processes will participate in a specific transaction. The commit protocol is completely independent of the mechanism for process creation, and each participant P communicates to the protocol the identities of the other participants to which it is connected only when P itself is ready for committing. Therefore, we propose a model of completely distributed transactions, without any hierarchical structure among the participant processes. In particular, a participant knows only those participants to which it is directly connected. No specific assumption is made about the method used to achieve serializability of accesses to the same resource by different transactions. Furthermore, no particular recovery mechanism is involved. Finally, the presence of a deadlock detectionprotocol [6],[ l l ] , [12], [16], [25], [30] is assumed, which solves deadlock situations by causing the abortion of the appropriate transaction.

11. MULTICOMPUTER TRANSACTIONS We consider an environment consisting of one or more virtual sites [2], [3], each encapsulating a specific resource of the system (e.g., a portion of the database) and all connected by a virtual communication network. Requests for operations to be executed on the encapsulated resource can be received by the relevant virtual site, which creates processes to satisfy them. The actual system consists of one or more computers connected by a communication network. Each computer has one or more processors and some memory. Two kinds of memory are provided: stable memory and volatile memory; the contents of volatile memory are lost following a crash, whereas the contents of stable memory survive crashes. Each resource is represented by a combination of volatile and stable storage. The stable storage maintains a consistent state of the resource. Each virtual site is implemented on the computer which actually contains its resource. At any instant, we can have one or more virtual sites per computer and one or more processes per virtual site. Furthermore, we assume a specific virtual site, called an applicative site, which exists on every computer. The applicative site manages all user applications. In this environment, let us define a multicornputer transaction as an atomic activity started by a user application process (applicative participant) and carried out by cooperation [31] between processes (service participants), residing on the same or on different virtual sites. Templates for applicative participants and service participants might be the following:

Process AP; Begin Begin-transaction ; get-user-request; (request service); if acceptance-test = OK then Ready-to-commit else Abort-transaction; if success then put-message(”0K”) else put-message(”AB0RT”) end. (a) Applicative participant. Process SP; Begin Enter- transaction ; (request service); if acceptance-test = OK then Ready-to-commit else A bort-transaction end. (b) Service participant. In the following sections, we shall use the term participant to refer to either an applicative participant or a service participant, and we shall simply say site to mean virtual site. The (request service) part of a participant performs a sequence of one or more operations on the local resource and may contain calls to the operations defined on some other remote resources. Each process, when terminating its service, checks both that errors have not been made and that abnormal conditions (e.g., an overflow exception) have not occurred during its execution. If everything is all right, it calls the system primitive Ready-to-commit, which saves, in stable storage, a copy of the updates the process had made to the resource. Otherwise, the process terminates by calling the primitive Abort-transaction, without modifying the resource. A transaction is created when an applicative participant calls the Begin-transaction primitive. A service participant joins the transaction by means of the Enter-transaction primitive. A transaction terminates either when all its participants have executed the Readyto-commit primitive (commitment phase) or when at least one participant has executed the A bort-transaction primitive (abort

720

local panlclpants

panicipane

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 5, MAY 1990

c

transactJon status

transamon status

local participants

I

Q STT

I

V7T

Fig. 1. Tables maintained by each virtual site for managing transactions.

phase). A transaction can involve resources at any site of the system. When a participant needs other resources, it prepares a request message for each of them. Such a message contains the transaction identifier (which is assigned by the site on which the transaction has been created), the sender process name, the names of the sending and receiving sites, and the description of the pertinent resource operation. When a request message has been satisfied, an acknowledgment message, containing the target process identifier and, possibly, the results, is sent to the requesting process. The request is included in a remote procedure call mechanism with exactly-once semantics [29], [32]; in case of unsuccessful remote call, the requesting process aborts the transaction. 111. TRANSACTION MANAGEMENT PROTOCOL In order to manage multicomputer transactions, every site maintains a volatile transaction table (VTT) and a stable transaction table (STT) for each transaction in which it is involved (Fig. 1). The VTT is stored in volatile memory, whereas the STT is stored in stable memory. The VTT contains the status of the transaction, and information about the participants which the virtual site knows are involved in the transaction. For each local participant, the VTT contains the triple {name, status, connected participants}, i.e., the name of the participant, its status, and the list of the names of the participants directly connected to this participant. The name of a participant is the pair {creating site identifier, local process identifier}. For each remote participant, the VTT contains the pair {name, status}. This allows a site to know the names of other sites involved in a transaction. Two participants P and Q are directly connected if either is the parent of the other one; a participant P is the parent of a participant Q if P is the sender of the request message that has caused the creation of Q. In this case, Q is the child of P . The STT contains the same information as the VTT, except that the STT considers only the local processes participating in the transaction.

A . Description of the Protocol Let us now present the commit protocol in the absence-of-failures case by describing the actions performed by a site as a consequence of either the execution of a transaction primitive by a local participant or the reception of a protocol message sent by another site. Transaction Primitives: The following primitives are provided to manage transactions: Begin-transaction , Enter-transaction, Ready-to-commit, and Abort-transaction. All the primitives, except the second, can be called by applicative participants, whereas service participants can call all the primitives except the first. When an applicative participant calls the Begin-transaction primitive, a new transaction identifier is generated and two tables (the VTT and STT) are allocated. The status of the transaction and the status of the applicative process are set to active in both tables. As a consequence of the reception of a request message by the target site, a participant P is created at this site. This process joins the transaction by executing the Enter-transaction primitive, which

creates a new entry { P , active, -} in both the VTT and STT. Note that the tables are created at the creation of the first local participant. A participant enters the ready state when concluding the execution of the Ready-to-commit primitive. This primitive has, as input parameters, the transaction identifier, the ready process P , and the list of the participants directly connected to P.' The Ready-tocommit primitive. 1) saves (in stable storage) an image of the updates made by the process to the resource; 2) sets (in the VTT) the status of P to ready and updates (in the VTT) the list of participants connected to P , and 3) creates a VTT entry { P i , active} for each remote participant Pi connected to P and not previously known. Afterwards, if all the participants contained in the VTT are ready, the Ready-to-commit primitive 1) performs the local commit of the transaction T , that is, a) it sets the transaction status (in both tables) to committed, and b) it commits T , that is, it actually performs the updates to the resource, and 2) sends a READY message to all involved sites. The READY message contains the transaction name, the name of the ready process, and the list of participants directly connected to this process. Otherwise (if at least one participant in the VTT is still active), the primitive 1) copies the VTT entry pertinent to P into the STT; 2) sends a READYmessage to all sites it knows are involved in T , and 3) sends as many READY messages as the number of local ready participants to each involved site not previously known. Let us note that, if no Ready-to-commit primitive has been executed in the system, each virtual site has only a partial knowledge of the processes involved in a transaction: it knows only its local participants. A participant enters the abort state when executing the Aborttransaction primitive. When a participant enters the abort state, the relative site 1) sets the status of T to aborted; 2) aborts T , that is, it forgets the updates which should be made to the resource; 3) sends an ABORT message, containing the transaction name to all virtual sites in the system,* and 4)forgets T , that is, it destroys the VTT and STT associated with T. It should be noted that whenever a site aborts a transaction, it checks if it contains any active participant. In the affirmative case, it kills such processes, that is, it forces them to terminate. Protocol Messages: Two messages are exchanged among virtual sites to manage multisite transactions: READY and ABORT. Two more messages (COMMIT and TIME-OUT) are provided to treat failures and will be introduced in the next section. When an involved site S receives a READY message, it 1) updates (in the VTT) the state of the ready participant to ready, possibly after creating a new VTT entry for it, and 2) creates a new VTT entry for each participant named in the message and not previously known to S ; the state of such participants is set to active. Afterwards, if S realizes that all the participants contained in the VTT of transaction T a r e ready, it performs the local commit of T . Otherwise, it sends as many READY messages as the number of local ready participants to each involved site not previously known. In this way, complete information about all the processes participating in a specific transaction Tis spread throughout the system. This allows each of the sites involved in T to know all the processes involved in T and their states. On receiving an ABORT message, an involved site locally aborts T , that is, it aborts T and forgets T.

' Each participant keeps the names of its parent and its children.

* An ABORT message is sent to all sites because the sending site possibly knows only a subset of all involved sites.

72 1

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39. NO. 5 , MAY 1990

B. An Informal Proof of Correctness Let us show, in an informal way, that if a site commits a transaction T , then all the sites involved commit T . Let us suppose that all the entries of the VTT of a given site S contain ready participants and that there exists a process Pjparticipating in T and not included in this VTT. This situation could lead to an inconsistency: S commits T anyway, whereas all the other sites involved decide either to commit or to abort T depending on the decision of Pj . But, if Pjparticipates in T , it is surely connected to at least one other participant Pk (either parent or child). If Pk is present in the VTT of S, then either Pk has called the Ready-to-commit primitive specifying Pjas a connected process or S has received a READY message, pertinent to Pk,containing Pias a connected Participant. But we have supposed that S does not know Pj;therefore, Pk cannot be ready. On the other hand, Pk cannot be present in the VTT of site S as an active participant since we supposed that all the participants contained in the VTT are ready. This means that Pk is also not known to S . In the same way, we could conclude that no process participating in T could be contained in the VTT relevant to Ton site S . That is, obviously, absurd. Therefore, when all the participants contained in the VTT of a given site are ready, these processes are all of the participants: S can commit because all the other sites will commit, sooner or later. Let us note that, since the names of the involved sites are obtained from the names of the participants, when the VTT of a site S involved in T contains all ready participants, S knows all the sites involved in T. IV. FAILURE TREATMENT The protocol guarantees that all involved sites either commit or all abort a transaction, even in the case of communication media and computer failures. Let us first analyze the case of a site crash. A site crash may be caused either by a processor failure or by a hardware error detected by a diagnostic process, periodically running on the virtual site. When a site crashes, all the intermediate states (not recorded in stable storage) of transactions in execution at that site disappear. When the site restarts after a crash, a restart process runs, which is able to restore, from the STT’s in stable storage, the state of all the local participants relevant to unterminated transactions. From this point of view, the active state of a transaction can be distinguished in two substates: in-progress, if there is at least one active local participant, and waiting, when all the local participants are ready. The restart process behaves as follows: 1) for each in-progress transaction, it locally aborts the transaction, and sends an ABORT message to all sites in the system; 2) for each waiting transaction, it creates a new VTT. It a) copies active} the STT into the VTT, and b) creates a new VTT entry {Pi, for each remote participant Pj connected to a local participant. Afterwards, the restart process activates the time-out mechanism and sends a TIME-OUT message, containing the transaction name T and the names of the active remote participants (known to the site), to all the known involved sites. For each transaction T , a site keeps a timer, which is started as soon as Tenters the waiting state. The timer is restarted whenever a READY message pertinent to T is received. When the value of the timer is equal to a predefined value, a time-out occurs and a TIMEOUT message is sent. On receiving a TIME-OUT message, a site either 1) (re-)sends a COMMIT message, containing the transaction name, if the transaction has been committed, or 2) (re-)sends an ABORT message if the transaction has been forgotten, or 3) (re-)sends a READY message for each of the ready local participants specified in the TIME-OUT message. Let us now consider the case in which a message is lost owing to a temporary failure of the communication media. In order to avoid endless waiting by a site which has lost an ABORT or a READY As previously stated, a site involved in T forgets T when it aborts T.

message following a communication media failure, the protocol makes use of the time-out mechanism. It should be noted that the loss of any message (a TIME-OUT sent by the restarted site, or a READY, COMMIT, or ABORT sent by another site) during the restart phase of a site does not cause endless waiting thanks to the fact that the restarted site periodically resends TIME-OUT messages until it commits or aborts the transaction. Furthermore, the information contained in the STT’s allows a restarted site to reply correctly to a TIME-OUT message. Finally, we wish to point out that hardware fault-tolerance techniques, such as hardware duplication, must be adopted in order to avoid the abovementioned timer becoming a point of catastrophic failure in the system.

v. TOLERATED AND UNTOLERATED FAILURES Generally, no system can survive all possible failures, but can only tolerate certain sets of failures. The aim of this section is to explain which failures are tolerated by our protocol and which are not. The following are untolerable failures: 1) a faulty processor that writes into stable storage or sends messages; 2) permanent hardware failures relevant to either the computers or the communication media; for example, a computer that stays down or is disconnected from the network forever; 3) malicious modification of messages, such as a good message that is transformed into another good message. Two further assumptions are made: 1) no good message is spontaneously created by the communication network; and 2) if a message is sent repeatedly, at least one good copy will be received. This latter assumption stems from the fact that a possible communication network failure is never permanent. The following situations are tolerated: 1) temporary failures of the computers; 2) a protocol message which is lost; 3) a protocol message that is duplicated and arrives more than once, and 4) a protocol message which is arbitrarily delayed. These situations are tolerable because 1) upon the restart after a crash, the restart process obtains the correct information about all active transactions in which the relative site is involved by using the STT’s and by means of the proper exchange of messages among virtual sites. In this way, after the restart phase, the site can correctly execute the commit protocol; 2) as shown in Section 111-B,the loss of protocol messages cannot cause an inconsistent behavior of a virtual site: the time-out mechanism allows us to manage this situation; 3) a correct message that arrives more than once does not modify the transaction situation; we can say that it is simply discarded; and 4) since no assumption is made about the order in which protocol messages are received, a message can be arbitrarily delayed. Finally, we wish to point out that it is easy to demonstrate that our protocol ensures that all participants either commit or abort a transaction even in a partitionable network [lo]. Such a network is one which may split into isolated groups of nodes that cannot communicate with each other. The problem with many protocols is that a partition that occurs during the execution of the commit protocol can cause some nodes to commit and others to abort the same transaction. ISSUES VI. IMPLEMENTATION Some implementation choices can be made in order to minimize 1) the number of stable writes and 2) the lifetime of transaction tables. In the presentation of the protocol we have assumed that when a process executes either the Begin-transaction primitive or the Enter-transaction primitive, a record pertinent to this process is written in stable memory. Actually, such a write can be omitted and the first stable write is performed only when the transaction locally enters the waiting state. In this way, we spare one stable write for each participant. However, it may happen that an STT (specifying the waiting state) is associated with an in-progress transaction. This occurs when a new local participant belonging to a waiting transaction executes the Enter-transaction primitive: the transaction returns in the in-progress state and its STT turns out to be nonupdated. For this reason, after a crash, an in-progress transaction

722 may be either unknown, or known as waiting, by an involved site. In the first case, the restarted site correctly aborts the transaction, but it does not send any ABORTmessage. The other sites will discover the above condition when they require information about the transaction by means of the TZME-OUTmessage (the restarted site, on realizing that it has no information about the transaction, will reply with an A B O R T message to the requiring sites). In the second case, the restarted site sends a TZME-OUT message. Sooner or later, it will receive either a READY message (containing, as a connected participant, the name of a local process not known by this site) or an A B O R T message (sent by the site(s) at which an active participant runs which is connected to a participant which was in the active state on the restarted site before the crash).4 In both cases the restarted site aborts the transaction. As previously stated, when a transaction T is aborted, a site disposes the transaction tables pertinent to T. On the contrary, when a transaction is committed, its STT must be kept. Indeed, a site which knows that the transaction has been committed can inform an involved site having a nonupdated VTT due to loss of protocol messages. To avoid the keeping of an unbound number of tables in stable memory, a garbage collection protocol is needed. This protocol can be based on the consideration that a site can destroy a table pertinent to a committed transaction when it realizes that all the other involved sites have committed the transaction. A simple garbage collection protocol could be organized as follows. Each site periodically (and whenever the size of the tables reaches specified bounds) sends, to all the other sites, a message containing the state of all the nonin-progress transactions it is aware of. Each of these transactions may be either waiting or committed. A site disposes of an STT pertinent to a transaction T when it realizes that all other involved sites have committed T, i.e., they have all notified the commit of T by means of the periodic messages mentioned above. In order for this mechanism to work correctly, we must ensure that periodic messages are received in the same order in which they are sent. This can be achieved by numbering these messages. VII. PERFORMANCE ISSUES A. 2P Commit Protocol Several algorithms for the commitment of distributed transactions have been proposed in the literature. The best known is the two-phase (2P) commit protocol, which we shall describe briefly. Let us suppose that no failure occurs in the system. During the first phase, the coordinator (the process that gave rise to transaction T ) sends a prepare message, in parallel, to all subordinates (the processes participating in T ) to ask them whether they agree to committing T. Each process which wants to commit T first writes a commit log record, then sends ayes vote to the coordinator and waits for the final decision (commit/abort) from the coordinator. On the contrary, if a process has aborted T, or wants to abort T, it sends a no vote and forgets T without waiting for an answer from the coordinator. After receiving all the votes from the subordinates, the coordinator starts the second phase. If all the votes are yes votes, it writes a commit log record and informs all the subordinates to commit T by sending them a commit message. If at least one vote is a no vote, the coordinator writes an abort log record and sends an abort message to all the subordinates that had sent a yes vote. After receiving a commit (abort) message, each subordinate locally commits (aborts) the transaction and sends an acknowledgment message to the coordinator. After receiving all the acknowledgment messages from the processes that were sent a message in the second phase, the coordinator forgets T. A participant which is not able to continue its conversation with a remote participant decides to abort the transaction. We suppose that the remote procedure call mechanism discovers this situation. Actually, for each committed transaction it is sufficient to keep oniy the transaction status field of the pertinent STT.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 5 , MAY 1990

B. Comparison to 2P Let us compare 2P to our protocol on the basis of the number of protocol messages and the number of stable writes, respectively. In the execution of the 2P protocol, a committed transaction requires two messages by each subordinate (a yes vote and an acknowledgment) and two messages by the coordinator (one prepare and one commit) to each subordinate. Therefore, if the transaction involves n processes, the number M of messages in 2P is =4

* (n- 1).

In case a broadcast communication mechanism [7], [19], [26] is available, the 2 * (n - 1) messages sent by the coordinator are substituted by 2 broadcast messages. It follows that

M;;) = 2

* n.

Our protocol, in the absence of a broadcast facility, requires s - 1 READY messages for each participant, where s is the number of vimal sites involved in the transaction. It results that M=n

* (s-

1)sn

* (n-

1).

The upper bound is reached when there is one participant per site. Fig. 2 shows that the number of messages decreases as the number of participants per site increases, that is, as the parallelism on each site increases. If broadcasting of messages is available, we have M W= n .

The preceding estimates show that if broadcast messages are available, the number of messages in our protocol is reduced by 50 percent with respect to 2P. Furthermore, when a broadcast facility is not available, the cost of our protocol still remains attractive for n I 4. Let us now analyze the number of stable writes in both protocols. In 2P, for a committed transaction each subordinate and the coordinator write two records (prepare and commit for the subordinate, commit and end for the coordinator). Therefore, if n is the number of involved processes, we have

SW2p= 2

* n.

In our protocol, as shown in Section VI, each involved site writes 1) a stable record each time the transaction enters the waiting state, and 2) a stable record to assert the commitment of the transaction. It follows that S

s w = c w,+s i= 1

where w iis the number of times the transaction passes from the inprogress state to the waiting state. It results that

2ssSW12n. The upper bound refers to the case in which each time a participant becomes ready the transaction enters the waiting state. For example, this happens when there is one participant per site.

C . Characteristics of our Protocol We wish to point out that our protocol, unlike the 2P protocol, has a genuinely distributed approach both to the committing and to the aborting of a multicomputer transaction. In the 2P protocol, when a process decides locally to abort a transaction, it can communicate such a decision to all the other processes involved in the transaction only through the coordinator and only after receiving the prepare message from the coordinator itself (centralized abort). On the contrary, in our protocol, any participant which locally verifies an

IEEE TRANSACTIONS ON COMPUTERS, VOL. 39, NO. 5, MAY 1990

723

0

2

4

e

6

processes

Fig. 2. Number of messages in the 2P protocol and our protocol.

abort condition is able to immediately send an ABORT message to all involved sites. Therefore, any site can be informed to abort a transaction by any other site involved in the transaction itself, simply because the concept of a coordinator does not exist at all (distributed abort). Similarly, in our protocol, the READY messages sent by all the participants involved in the transaction are sufficient to allow a site to commit the transaction. Again, no coordinator exists (distributed commit). Our protocol requires that the system provides a garbage collection facility. Anyway, we wish to point out that the portion of stable storage needed for a committed transaction just consists of the transaction identifier. Like the 2P protocol, our commit protocol is also blocking [lo], [33], [34]. Thus, a restarted site cannot decide whether to abort or to commit a waiting transaction based only on local information: it needs to communicate with other participants. On the other hand, when designing the protocol, we did not make the following assumptions, generally required for nonblocking protocols: 1) messages are not lost, and 2) all messages are delivered in the order in which they are sent. With assumptions 1) and 2), we could modify the protocol as follows. When all the participants contained in the VTT are ready, the transaction enters the prepare-to-commit state and the site sends a COMMIT message to all involved sites. After receiving a COMMIT message from all involved sites, a site commits the transaction. With this modification, our protocol becomes nonblocking, but we have not chosen to incorporate it, owing to the resulting high overhead in terms of both the number of

messages and stable writes, and the lower parallelism in protocol execution in the absence of failures.

VIII. CONCLUSIONS An extremely simple and genuinely distributed commit protocol for multicomputer transactions has been presented. It supports a model of completely distributed transactions, with no hierarchical structure among the participant processes and no centralized locus of control. It requires each participant to have only a partial knowledge of the other participants. In particular, each participant knows only those participants to which it is directly connected. The transaction model presented has two major advantages: 1) if a local failure occurs, a virtual site can decide to repeat the actions relevant to a service request without affecting the overall transaction; 2) portions of a transaction can run concurrently if they involve different resources or different parts of the same resource, thus achieving the maximum degree of parallelism. Our protocol and the traditional two-phase commit protocol have been compared. We have found that the latter is not able to satisfy the degree of parallelism and distribution demonstrated by the proposed model of transaction. Moreover, in a system provided with a broadcast facility, 2P requires a higher overhead in terms of the number of message exchanges. ACKNOWLEDGMENT We would like to thank the anonymous referees and A. Bertolino for all their comments regarding improvement of the paper.

724

IEEE TRANSACTIONSON COMPUTERS,VOL. 39, NO. 5 , MAY 1990 REFERENCES

[31

[41 [51

171 I81 191

I121

P. Ancilotti, A. Bertolino, and M. Fusani, “An approach to efficient distributed transactions,” Distributed Computing, no. 2, pp. 201212, 1988. P. Ancilotti and M. Fusani, ‘‘Support for transactions and recovery in CNET applications,” Computer Architecture Technical Committee Newsletter, IEEE Computer Society, pp. 87-96, June 1985. P. Ancilotti, B. Lazzerini, C. A. Prete, and M. Sacchi, “A proposal for distributed commitment and abort of multi-site transactions in a multimicroprocessor system,” Safety of Computer Control System 1986 (Safecomp ’86), Trends in Safe Real Time Computer Systems, Proc. Fifty ZFAC Workshop, Sarlat, France, Oct. 14-17, 1986, pp. 63-66. T. Anderson and P. Lee, Fault Tolerant Principles and Practice. Englewood Cliffs, NJ: Prentice-Hall, 1981. P. A. Bernstein, D. W. Shipman,and W. S. Wong, “Formal aspects of serializability in database concurrency control, ” ZEEE Trans. Software Eng., vol. SE-5, no. 3, pp. 203-216, May 1979. G. Bracha and S. Toueg, “A distributed algorithm for generalized deadlock detection,” Dep. Comput. Sci., Cornel1 Univ. Tech. Rep. 83559, June 1983. D. D. Clark, K. T. Pogran, and D. P. Reed, “An introduction to local area networks,” Proc. ZEEE, vol. 66, no. 11, Nov. 1978. E. Cooper, “Analysis of distributed commit protocols,” in Proc. SIGMOD Znt. Conf. Management Data, June 1982. K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger, “The notion of consistency and predicate locks in a database system,” Commun. ACM, vol. 19, no. 11, pp. 624-633, Nov. 1976. H. Garcia-Molina and R. K. Abbot, “Reliable distributed database management,” Proc. IEEE, vol. 75, no. 5, pp. 601-620, May 1987. V. D. Gligor and S. H. Shattuck, “On deadlock detection in distributed systems,” ZEEE Trans. Software Eng., vol. SE-6, no. 5, pp. 435440, Sept. 1980. B. Goldman, “Deadlock detection in computer networks,” S.M. thesis, M.I.T. Dep. Elec. Eng. Comput. Sci., available as M.I.T. Lab. for Comput. Sci. Tech. Rep. 185, Sept. 1977. J. N. Gray et al., “Granularity of locks and degrees of consistency in a shared database,” IBM Res. Rep. RJ1654, Sept. 1975. J. N. Gray, “Notes on data base operating system,” in Operating Systems-An Advanced Course, Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag. 1978, pp. 393-481. -, “The transaction concept: Virtues and limitations,” in Proc. 7fh Znt. ConJ Very Large Data Bases, Cannes, France, Sept. 1981, pp. 144- 154. S. S. Isloor and T. A. Marsland, “The deadlock problem: An overview,” IEEE Comput. Mag., vol. 13, no. 9, pp. 58-78, Sept. 1980. I. M. Jaffe, “Parallel computation: Synchronization, scheduling and schemes,” Ph.D. dissertation, MassachusettsInstitute of Technology, Aug. 1979. E. D. Jensen, “Distributed control,” in Distributed SystemsArchitecture and Implementation-An Advanced Course, Lecture

Notes in Computer Science. Berlin, Germany: Springer-Verlag, 1981, pp. 175-190. K. Kummerle and M. Reiser, ‘‘Localarea communication networksAn overview,” J. Telecommun. Networks, vol. 1, no.. 4, Winter 1982. B. W. Lampson and H. E. Sturgis, “Crash recovery in a distributed data storage system,” Xerox PARC, Palo Alto, CA, Apr. 1975. B. Lampson, “Atomic transactions,” in Distributed SystemsArchitecture and Implementation-An Advanced Course, Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag, 1981, pp. 246-264. B. Lindsay, P. Selinger, C. Galtieri, J. Gray, R. Lorie, F. Putzolu, I. Traiger, and B. Wade, “Single and multi-site recovery facilities,” in Dktribufed Data Bases, I. W. Draffa and F. Poole, Eds. Cambridge, MA: Cambridge University Press, 1980. B. Lindsay, L. Haas, C. Mohan, P. Wilms, and R. Yost, “Computation and communication in R*: A distributed database manager,” in Proc. 9th ACM Symp. Oper. Syst. Principles, Bretton Woods, Oct. 1983, pp. 1-2 (only extended abstract). B. Lindsay and C. Mohan, “Efficient commit protocols for the tree of processes model of distributed transactions,” in Proc. 2nd ACM Symp. Principles Distributed Comput; also Oper. Syst. Rev., vol. 19. no. 2, pp. 40-52, Apr. 1985. D. A. Menasce and R. R. Muntz, “Locking and deadlock detection in distributed data bases,” ZEEE Trans. Software Eng., vol. SE-5, no. 3, pp. 195-202, May 1979. R. Metcalfe and D. Boggs, “Ethernet: Distributedpacket switching for local computernetworks,” Commun. ACM, vol. 19, no. 17, pp. 395404, 1976. C. Mohan, R. Strong, and S. Finkelstein, “Method for distributed transaction commit and recovery using byzantine agreement within clusters of processors,” in Proc. 2nd ACM SZGACT/SZGOPS Symp. Principles Distributed Comput., Montreal, P.Q., Canada, Aug. 1983; also IBM Res. Rep. RJ3882, June 1983. J. E. B. Moss, Nested Transactions: An Approach to Reliable Distributed Computing. Cambridge, MA: MIT Press, 1985. B. J. Nelson, “Remote procedure call,” Rep. CMU-CS-81-119, Dep. Comput. Sci., Carnegie-MellonUniv., May 1981. R. Obermarck, “Global deadlock detection algorithm,” IBM Rep. RJ2845, June 1980. B. Randell, P. A. Lee, and P. C. Treleaven, “Reliability issues in computing system design,” Compuf. Surveys, vol. 10, no. 2, pp. 123-165, June 1978. S. K. Shrivastava and F. Panzieri, “The design of a reliable remote procedure call mechanism,” ZEEE Trans. Comput., vol. C-31, pp. 692-697, July 1982. D. Skeen, “Nonblocking commit protocols,” in Proc. ACM/ SZGMOD Znt. Conf. Management Data, Ann Arbor, MI, 1981, pp. 133-142. D. Skeen and M. Stonebraker, “A formal model of crash recovery in a distributed system,” ZEEE Trans. Software Eng., vol. SE-9, no. 3, pp. 219-228, May 1983.

Lihat lebih banyak...

A distributed commit protocol for a multicomputer system

Descrição do Produto

Comentários