Orca: a portable user-level shared object system

Share Embed


Descrição do Produto

Orca: a Portable User-Level Shared Object System * HENRI E. BAL RAOUL BHOEDJANG RUTGER HOFMAN CERIEL JACOBS KOEN LANGENDOEN TIM RÜHL Dept. of Mathematics and Computer Science Vrije Universiteit Amsterdam, The Netherlands M. FRANS KAASHOEK M.I.T. Laboratory for Computer Science Cambridge, MA

ABSTRACT Orca is an object-based distributed shared memory system that is designed for writing portable and efficient parallel programs. Orca hides the communication substrate from the programmer by providing an abstract communication model based on shared objects. Mutual exclusion and condition synchronization are cleanly integrated in the model. Orca has been implemented using a layered system, consisting of a compiler, a runtime system, and a virtual machine (Panda). To implement shared objects efficiently on a distributed-memory machine, the Orca compiler generates regular expressions describing how shared objects are accessed. The runtime system uses this information together with runtime statistics to decide which objects to replicate and where to store nonreplicated objects. The Orca system has been implemented on a range of platforms (including Solaris, Amoeba, Parix, and the CM-5). Measurements of several benchmarks and applications across four platforms show that the new Orca system achieves portability with good performance. In addition, the measurements show that performance of the new system is as good as the previous implementation that was specialized for Amoeba. Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent Programming—Distributed programming, Parallel programming; D.3.2. [Programming Languages]: Language Classifications—Concurrent, distributed, and parallel languages; D.3.4 [Programming Languages]: Processors—Compilers; Run-time environments General Terms: Languages, Design, Performance Additional Key Words and Phrases: distributed shared memory, parallel processing, portability 333333333333333333 * This research was supported in part by the Netherlands organization for scientific research (N.W.O.) under grant 125-30-10.

1. INTRODUCTION The Orca distributed shared memory system allows programmers to write efficient parallel applications for a wide range of parallel architectures. The idea in the Orca system is to hide the underlying communication substrate from the programmer by allowing applications to logically share objects, even when there is no physical shared memory present in the system [1]. The key challenge is to implement this high-level programming model efficiently and in a portable fashion. To satisfy these often conflicting goals of ease-of-use, efficiency, and portability, we have redesigned and reimplemented the initial Orca compiler and runtime system [2]. One of the unique aspects of the new compiler is that it computes regular expressions that give a high-level description of how shared objects are accessed. The runtime system uses this information as hints about how objects are going to be used; most other systems can only predict memory usage patterns based on information about the past, unless the user adds annotations. Another key aspect of the new Orca system is that the runtime system is logically structured as three layers that can be statically configured to match the underlying platform. In this paper we describe the design and implementation of the new Orca system. Application performance measured on a wide range of architectures indicates that by using the Orca system, the combination of ease-of-use, efficiency, and portability can be obtained. Designing a system that is both easy to use, efficient, and portable is difficult. Often one is sacrificed for the other. The early implementations of Orca are efficient but depend on features offered only by the Amoeba distributed operating system [2]. As another example, PVM is portable but low-level [43]. Other systems rely on virtual memory support [9, 27], which is not uniformly supported on parallel computers and often is expensive to use [29]. In addition, most systems are integrated with low-level sequential languages that provide no direct support at all for parallel programming [22, 34, 41]. Finally, many systems force the programmer to deal with details such as setting up communication channels, constructing, sending and receiving messages, synchronizing access to shared data, associating synchronization objects with shared data [5, 25, 31], and so on. Many of these issues are complicated and distract the programmer from the real problem: writing the application. The programming model behind Orca is easy to use and provides a high degree of portability, because it abstracts from many of the low-level hardware mechanisms. The key idea in the model is to hide message passing from the user, and to support communication based on logically shared data, much as in shared virtual memory (SVM) systems. Unlike most SVM systems, however, the Orca programming model is based on sharing objects rather than virtual memory. An object in our model is an instance of an abstract data type, which is accessed through user-defined operations. A chief advantage of the model is that all operations on shared objects are executed indivisibly (as in monitors), so programmers do not have to worry about mutual exclusion synchronization. Also, condition synchronization is integrated cleanly in the model, by allowing operations to block. The challenge is to make the implementation of shared objects efficient and portable. In this paper, we describe a new Orca system that was designed from scratch aiming at portability. To achieve portability, we use a layered approach. The system contains several layers, and the machine-specific parts are isolated in the lowest layer. The compiler and the runtime system, for example, are fully machine-independent. The runtime system is implemented on top of a virtual machine called Panda, which provides the communication and multi-tasking primitives needed by the runtime system. So, porting the Orca system to a new machine comes down to porting Panda.

Two ideas make the portable implementation efficient. First, the compiler statically computes for each process type a sharing pattern. This architecture-independent pattern is a regular expression that describes a process’s read/write accesses to shared objects. For example, a process that repeatedly accesses a shared object X with read-only operations is described with a pattern: {X$R}. The compiler can compute these patterns because the Orca language was carefully designed to exclude any features that make interprocedural control flow and data flow analysis hard. The compiler passes these access patterns to the runtime system, as hints about how objects are going to be used. Based on the cost for remote accesses, the compiler-generated hints, and runtime statistics, the runtime system decides on a replication strategy that matches the underlying architecture and the application. The second idea is an instantiation of a good old implementation technique: Panda provides a virtual machine that statically can be configured to match the underlying architecture. If the underlying system, for example, supports reliable communication, the Panda system is statically configured to make use of this. This static elimination of machine layers is similar to the ideas found in the xkernel system to eliminate communication layers [38]. Typically a port of Panda requires implementing a small bottom layer that only needs a narrow interface to the underlying machine. Thus, with minimal effort the complete system can be made to run. Based on performance measurements, the implementation can then be incrementally improved, for example by eliminating some of the layering overhead. To evaluate the ideas we have implemented the new compiler and the Panda-based runtime system. We have successfully ported this implementation with little effort to a wide range of architectures: the CM-5, the PowerXplorer, the Amoeba processor pool, and a collection of UNIX workstations. These architectures differ greatly in communication performance, operating system, network architecture, and so on. Performance results of several applications show that the Orca system performs well on this wide range of architectures, although for systems with slow communication a larger problem size is sometimes needed. We compare the original implementation on the Amoeba processor pool with the new portable implementation and show that the latter performs as well as the high-performance non-portable implementation on a range of applications. The rest of this paper is organized as follows. In Section 2, we look at related work. In Section 3, we outline the shared data-object model. In Section 4, we discuss a portable implementation of this model. In Section 5, we describe how this system has been implemented on several different platforms. In Section 6, we analyze the performance of the system. In Section 7 we discuss the lessons we have learned and in Section 8 we give a summary of the paper. 2. RELATED WORK In the last decade a substantial number of Distributed Shared Memory (DSM) systems have been built. We discuss how our work differs from these systems. Early DSM systems, such as Li’s Shared Virtual Memory [27], simulate physical shared memory by extending the operating system. Traditional SVM has several performance problems, in particular it suffers from false sharing. More modern DSM systems try to obtain higher performance and often are implemented outside the operating system. Treadmarks [9] is a user-level, page-based DSM system. It is implemented entirely on top of the operating system. Like SVM, it simulates a shared memory with a linear address space. Also like SVM, TreadMarks partitions the memory into fixed-size pages and uses the virtual memory hardware

to detect page faults. Such page faults are handled by a user-level signal handler, whereas with Li’s SVM they are handled by the kernel. TreadMarks uses a relaxed memory consistency model called “lazy release consistency.” This model gives the same results as the more strict sequential consistency model (which is used in SVM), provided that certain synchronization restrictions are satisfied. In particular, all conflicting accesses to shared memory must be guarded by synchronization primitives (acquire and release) that are supplied by TreadMarks. Another important difference with SVM is that TreadMarks addresses the problem of false sharing by using a multiple-writer protocol. Multiple processors can write to the same page. The implementation uses a data structure (a “diff”) that contains the difference between two versions of a page. These diffs are exchanged between processors at certain synchronization points. Orca and TreadMarks differ in several ways, both in the programming model and in the implementation. Orca provides shared objects, rather than a (virtual) shared memory. Also, it uses sequential consistency. The implementation of Orca deals with user-defined operations on objects, rather than memory reads and writes, so it does not need virtual memory hardware. Also, as we will see, Orca uses an update protocol for replicated objects, whereas TreadMarks and SVM use invalidation. These differences clearly have an impact on performance. Orca eliminates the problem of false sharing, because it provides user-defined objects instead of system-defined fixed-size pages. TreadMarks reduces the overhead of false sharing using a multiple-writer protocol, but experience shows that several applications still suffer from false sharing [29]. Also, the separation of synchronization and data transfer and the overhead of handling page faults and exchanging diffs was found to be significant for several TreadMarks applications [29]. The performance impact of an update protocol versus an invalidation protocol depends very much on the application [3]. For most applications, we found Orca’s update protocol to be quite efficient. For page-based DSM systems, invalidation will in general be more efficient, because the same page may be written many times in succession. In Orca, consecutive writes are often encapsulated in a single user-defined operation, so the copies of an object can be updated just by broadcasting this single operation. Another DSM system related to Orca is CRL (C Region Library) [22]. Like Orca, CRL is implemented entirely in software. In particular, it does not use any hardware support to check references to shared data. CRL applications share regions, which are contiguous areas of memory. As with Orca’s objects, the size of a region is user-defined, to prevent false sharing. CRL supports primitives to map and unmap regions. Accesses to a region’s data must be grouped into operations; CRL provides primitives to begin and end an operation. CRL requires the programmer to distinguish between read and write operations. The Orca system automatically makes this distinction (using compiler analysis), and uses this information together with the access patterns to determine the replication strategy. Yet another related DSM system is SAM [41]. An important goal in SAM is to give the user more control over communication. The programmer must indicate how shared data will be accessed, and this information is used to reduce the communication and synchronization overhead, using prefetching and caching techniques. Two kinds of shared data are defined: values with singleassignment semantics and accumulators (which reflect producer-consumer behavior). Our Orca system provides a simpler, more uniform object model, and does not use information provided by the programmer. Instead, our goal is to have the compiler and runtime system together optimize communication.

Besides TreadMarks, CRL, and SAM, many other related distributed shared memory systems exist [4, 5, 15, 27, 32, 34, 41]. The Orca programming model has several characteristics that make it different from other forms of DSM. Foremost, it uses object-based DSM, instead of page-based DSM. Second, it is implemented entirely in user-space, with no changes to the operating system. Finally, our model is supported in a new programming language (although it can also be added to an existing language, such as C). 3. THE SHARED DATA-OBJECT MODEL The Orca programming model is based on shared objects instead of shared memory words. Both in physical shared memory systems and in virtual shared memory systems, processes communicate by reading and writing memory words. They cooperate using synchronization primitives designed for shared memory, such as locks, semaphores, and barriers. Unlike shared (virtual) memory systems, the shared data-object model is based on high-level operations on shared data structures and on implicit synchronization, which are integrated into the model. The starting point in the Orca model is to encapsulate shared data in data-objects, which are manipulated through operations of an abstract data type. An object may contain any number of internal variables and arbitrarily complex data structures. A key idea in the shared data-object model is to make each operation on an object indivisible. Each operation is applied to a single object, but within this object the operation can execute arbitrarily complex code using the object’s data. Conceptually, the object is protected by a lock, so all operations are executed without interfering with each other, much as in a monitor. Once an abstract data type has been defined, instances (objects) of this type can be declared. A process declaring an object can share the object with its children (and possibly further descendants) by passing it as a call-by-reference parameter when forking the child processes. As an example, we will discuss an object type used in a program for the Arc Consistency Problem (ACP) [30]. This program takes a set of N variables as input. Each variable can take values from a certain domain (e.g., all integers between 1 and M). The values are restricted by binary constraints defined on some pairs of variables. The goal of the program is to eliminate impossible values from the domains by repeatedly applying the constraints, until no further elimination is possible. The problem is solved in parallel by dividing the variables among the processors. Each processor works on eliminating values for the variables assigned to it. With this algorithm, each processor must be able to access the current domain values for the variables. In the shared data-object model, this can be achieved by storing these values in an object shared among all processes. The implementation of this object type (abstract data type) is shown in Figure 1. Each process sharing an object D of this type can apply two different operations to it. The invocation D$eliminate(v,S) eliminates all values in set S from the current domain of variable v. The invocation D$values(v) returns the current domain for v. This example illustrates several properties of the shared data-object model. The implementation of the operations does not contain any code to synchronize access to the object’s local data. The model guarantees that all operations are executed indivisibly, so mutual exclusion synchronization is done by the runtime system. (The precise semantics of the model are discussed in more detail in [14].) For example, if two processes concurrently invoke D$eliminate(v,S1) and

OBJECT Domain; TYPE ValueSet = SET OF integer; Domains: ARRAY[1..N] OF ValueSet;

# object’s private data

OPERATION eliminate(v: integer; S: ValueSet); BEGIN Domains[v] := Domains[v] - S; END; OPERATION values(v: integer): ValueSet; BEGIN RETURN Domains[v]; END; END;

Figure 1: An example object type. D$eliminate(v,S2), the result will be consistent, since the operations will be serialized. If two processes execute D$values(v), on the other hand, the two operations may be executed simultaneously, because they do not modify the object. Such operations are called read operations, and they play an important role in efficient implementations of the model. Another aspect illustrated in Figure 1 is the fact that operations on objects are arbitrarily complex. In the example, each operation reads or writes a set. The runtime system, however, treats each operation as a single action, which results in at most one communication event. SVM systems, on the other hand, access shared data structures through machine instructions that read or write one word at a time. If run on top of SVM, the code above might result in multiple messages (page transfers), depending on the size of the sets and on how they are stored in memory. Besides mutual exclusion synchronization, condition synchronization is important in parallel programming. Most SVM-based systems use low-level primitives for this. In Orca, condition synchronization is integrated cleanly into the model. The key idea is that operations are allowed to block in guard expressions, which are based on Dijkstra’s guarded commands [11]. A guard is a Boolean expression that must be satisfied before the operation can begin. Each operation may contain multiple guards and blocks as long as all guards are false. A blocking operation suspends the process that invoked the operation. As soon as one or more guards become true, one true guard is selected nondeterministically and its sequence of statements is executed. We illustrate guards through another example object used in the Arc Consistency program. With ACP, it is difficult to determine when the program should terminate. Multiple processes repeatedly recheck the constraints. If the domain for a variable v has been changed, other variables must be rechecked. So, even if a process has finished rechecking the constraints for all its variables, it cannot terminate, because other processes may generate new work for it. The program should only terminate if two conditions are satisfied: (1) all processes are idle, and (2) no variables need to be rechecked. We can implement this in the shared data-object model using an object type WorkAdministration, shown in Figure 2. This object contains a set recheck containing the variables that must be rechecked. (For brevity, the operations on this set are not shown in the example.) Also, the object keeps track of the number of active processes. If a process has finished rechecking its own variables, it first calls Ready to indicate it is ready to terminate and then it invokes WaitForWork. This operation

OBJECT WorkAdministration; TYPE VariableSet = SET OF integer; recheck: VariableSet; ActiveProcesses: integer;

# set of variables to recheck # number of active processes

... OPERATION Ready(); BEGIN ActiveProcesses -:= 1; END; OPERATION WaitForWork(S: VariableSet): boolean; BEGIN # Wait until either (1) the intersection of S and recheck # is non-empty, or (2) the intersection is empty and # all processes are idle. GUARD SIZE(S * recheck) > 0 DO ActiveProcesses +:= 1; RETURN true; OD; GUARD (ActiveProcesses = 0) AND (SIZE(S * recheck) = 0) DO RETURN false; OD; END; END;

Figure 2: An example object type using guards. consists of two guarded alternatives. The first guard becomes true as soon as some variables in S exist that have to be rechecked. The second guard implements the termination condition stated above. The operation blocks if neither of the conditions is true. In this case, there is no work yet for the current process, but other processes may still generate work for it. This example illustrates several interesting aspects of our model. First, our condition synchronization mechanism has a high level of abstraction. It allows complicated conditions to be expressed easily. The implementation takes care of checking the conditions, in the worst case by reevaluating the Boolean expressions each time the object is modified. Also, the synchronization is hidden inside the object’s operations. Users of the object do not include any code for synchronization. Finally, as we stated before, each operation is applied to a single object, but each object and operation can be arbitrarily complex. In the example above, the object contains two pieces of information: the variables that have to be rechecked and the number of active processes. By storing them in a single object, it is possible to define a WaitForWork operation that uses both of them. Even though the operation may block, it is still indivisible. If the guards fail, the operation has no side effects, because guards can only appear before the statement list. As soon as a guard becomes true, the entire operation (including the guard) is executed as a single atomic action. The shared data-object model resembles the use of monitors, in that mutual exclusion synchronization is done automatically. However, we use a higher-level condition synchronization mechanism. With monitors, explicit wait and signal primitives must be used. Also, monitors are designed for systems with shared-memory. Shared objects can be implemented efficiently without shared memory by replicating them.

The shared data-object model thus is a high-level model for parallel programming. It is similar to programming with monitors on shared memory, but it uses object-based techniques and it is suitable for systems that do not have shared memory. It integrates sharing of data, mutual exclusion synchronization, and condition synchronization in a clean way. In conventional page-based SVM systems, these features are rarely integrated in a simple and efficient way [35]. The model described above is supported in Orca, which is a language for writing coarse-grained parallel applications that run on distributed-memory systems. Orca has several advantages over existing languages. It has been specially designed for distributed-memory systems. For example, it allows any data structure (e.g., a list or a graph) to be passed as a parameter in an operation, and does marshalling of parameters automatically in the runtime system. Also, Orca is a type-secure language [20]; all violations of the language rules are detected by the compiler or runtime system. Examples are out-of-bound array references and references to memory that has been deallocated. Type-security is a significant advantage, since it eases debugging of parallel programs. 4. A PORTABLE AND EFFICIENT IMPLEMENTATION The combination of efficiency and portability is hard to achieve for any programming system. With the Orca programming model, we have to solve two challenging problems: 1.

How to efficiently map shared objects onto the target system’s memory hierarchy.

2.

How to efficiently map object operations onto the target system’s communication substrate.

For both problems, the best solution depends on the primitives provided by the hardware and operating system software of the target machine. Since a wide variety of communication and synchronization primitives exist, with widely different semantics and performance characteristics, it is very difficult to solve the above problems in a portable and efficient way. Our solution to these problems is based on two ideas. First, the new Orca system performs extensive compile-time analysis of Orca programs to give the runtime system the information necessary for making the right decisions about object replication and placement. Second, we use a layered approach that isolates machine-dependencies in a small, well-defined part of the Orca system, while still allowing the system to exploit certain useful functionality (e.g., reliable message passing) provided by the target platform. We have built a new portable implementation of the shared data-object model, using a layered approach. In general, we try to keep the overhead of the layered approach as small as possible, by using simple interfaces (consisting of procedure calls) between the layers. The runtime system can be statically configured to take advantage of the features of the underlying system. Figure 3 depicts the layers of the Orca system. The first layer is the compiler, which compiles Orca programs to ANSI C. The compiler also generates information about objects for the runtime system. In particular, the compiler determines which operations on objects are read-only and it generates information about how processes access shared objects. The runtime system uses all this information to implement objects efficiently. The runtime system (RTS) is the second layer. It is responsible for managing Orca processes and objects. It may decide to replicate or migrate objects using the information from the compiler. The RTS is entirely machine independent: it does not make calls to primitives of the underlying operating system, but uses Panda primitives instead.

1

Compiler

2

Language runtime system (object management)

3

Panda (Group Communication, RPC, and threads) Operating system Hardware

Figure 3: Layers in the implementation of the shared data-object model. Panda is the third layer in the Orca system. It provides the system facilities needed to implement the RTS. This layer must be ported to run Orca programs on a new architecture. The Panda layer provides threads, RPC, and group communication [23]. Group communication in Panda is totally-ordered, which means that all group messages are received by all processors in the same total order. Totally-ordered group communication is useful for maintaining the consistency of copies of objects [2]. To implement RPC and totally-ordered group communication, Panda assumes that the underlying operating system or hardware provides nothing more than unreliable point-to-point communication (unicast). If more powerful primitives are provided, however, Panda can make use of them. 4.1. The Orca compiler The new Orca compiler translates Orca programs to ANSI C augmented with calls to the RTS. It is a full compiler (rather than a source-to-source translator), including a parser, semantical analyzer, optimizer, and code generator. We choose to generate C as output, because it is highly portable and available on virtually all parallel systems. In addition to generating code, the compiler plays a central role in making the Orca system efficient. It performs three important tasks that allow for efficient implementations of the shared dataobject model. First, the compiler classifies objects as LOCAL or SHARED (objects that are not shared can be implemented more efficiently). Objects become shared when they are passed as a call-byreference parameter to a new Orca process. Second, the compiler classifies operations as READ or WRITE . READ operations do not modify an object, whereas WRITE operations potentially do. Third, for each type of process the compiler computes a pattern that describes how such processes use shared objects. The pattern is a regular expression with special symbols denoting sequencing, selection, and repetition. For example, the pattern: #1$W; [ #2$R | {#3$W} ] indicates that the process will perform a WRITE operation on object 1, followed by either a READ operation on object 2 or a repetition of WRITE operations on object 3. The compiler passes a summary of the access pattern to the RTS, describing how each process accesses its shared objects. The summary contains a score for the different objects, which is used by the RTS to decide how to implement the objects. The scores are computed by the compiler using the heuristic that operations inside a repetition (loop) are executed more frequently than operations outside a repetition. Also, operations inside an if-statement are executed less frequently than operations

outside a selection. In future implementations, the compiler can also pass the pattern itself to the RTS, allowing the RTS to do other optimizations as well. All the information described above is passed on to the RTS, which uses it to select a replication strategy for objects. In this way, the RTS has a hint about how objects are going to be used. Most SVM systems do not have this advantage, as they are implemented entirely by a library or RTS, so they can only predict memory usage patterns based on information about the past, unless the user adds annotations. 4.2. The runtime system The runtime system manages objects and processes. Orca processes are implemented as Panda lightweight threads. All Orca processes on the same processor therefore can access the same state information, such as (copies of) shared objects. We will first look at the strategies and mechanisms for implementing shared objects, and discuss processes after that. 4.2.1. Strategies for implementing shared objects The most challenging problem the RTS is faced with is how to implement shared objects efficiently on a distributed-memory machine. For efficiency, the RTS uses several different implementation techniques. First of all, an object can be stored on one processor or it can be replicated. A nonreplicated object is accessed by remote machines through Remote Procedure Calls. If a certain machine accesses an object frequently, it may be useful to migrate the object to that machine. For replicated objects there are many options. An important issue is which processors get a copy (all or a subset). Also, a write operation can be implemented by invalidating the replicas or by sending either the new value of the object or the operation and its parameters. Based on an analysis of several different strategies [1, 3], the current Orca system either stores an object on one machine or on all machines that can access the object. Write operations on replicated objects are implemented by broadcasting the operation and updating all copies. We have chosen to use an update scheme to keep replicated objects consistent rather than an invalidation scheme. Objects sometimes contain large amounts of data, in which case invalidating copies is wasteful. Also, updating a copy will often take roughly the same CPU time and network bandwidth as sending invalidation messages. Many update messages contain only an operation identifier and a small number of parameters for the operation. A vital issue is how to decide whether to replicate an object and where to store nonreplicated objects. These decisions are taken dynamically by the RTS, using a hybrid approach based on compiler and runtime information. In addition to the compiler-generated summary of the access patterns, the RTS dynamically keeps track of the number of actual read and write operations issued by each processor on each object. When the RTS is asked to fork a process, it uses the compiler-generated information to decide whether to migrate an object, replicate an object on all processors that can access the object, or keep it on its current processor. The RTS uses its dynamic statistics to determine if the compiler estimated behavior matches the actual access pattern of an object. If not, the RTS reconsiders its initial decision and switches to the most efficient object distribution strategy based on the runtime statistics. Since switching strategies may involve migrating or replicating the entire state of the object, it may be an expensive operation. The advantage of the hybrid approach is that the strategy will only be changed during runtime if the compiler-generated information is

insufficient to make the right decision. The RTS thus maintains two types of usage information for each object, one based on the information generated by the compiler and one based on dynamic statistics (operation counts). Each object is assigned a manager, which is a processor that keeps track of this information. The information maintained by the manager is shown in Figure 4. It contains the dynamic number of read and write operations and the read and write scores generated by the compiler (summed over all processes). The compiler-generated information is updated whenever a new Orca process is created. The dynamic information is updated whenever the manager executes an operation or services a remote invocation request. For replicated objects, each processor keeps track of the number of read operations it executed on its local copy. This information is piggybacked on the next write operation. The dynamic read score of the manager may thus lag behind. In addition to the information in Figure 4, some additional state information is maintained to determine which processor executes the largest number of operations on an object; this information is used to determine the best location for objects that are not replicated. struct manager { double dynamic_writes; double dynamic_reads; double static_writes; double static_reads; ... };

/* /* /* /*

dynamic count of write operations dynamic count of read operations sum of compile-time write scores sum of compile-time read scores

*/ */ */ */

Figure 4: The information maintained by the manager of an object.

4.2.2. Mechanisms for implementing shared objects The RTS thus uses several strategies for implementing objects (full replication, migration, and single-copy). To support these strategies, the RTS implements mechanisms to update copies consistently, to migrate an object from one processor to another processor, and to perform an operation remotely. These mechanisms are relatively easy to implement, as objects are passive and thus do not contain any threads. When an operation on a shared object is invoked, the compiler generates a call to a runtime routine. This routine first checks if the object is replicated. If so and if the operation is a READ operation, it performs the operation locally. If it is a WRITE operation, the invoker broadcasts the operation using totally-ordered group communication and then blocks until it receives the operation in the total ordering. When a broadcast message with an operation is received, the RTS calls a procedure that performs the operation and, if the operation was issued by one of its local threads, unblocks the thread. The algorithm described above is the most general case. Whenever possible, however, the compiler uses special cases and avoids generating runtime tests. If the object is not replicated, it is stored either locally or remotely. In the local case, both the READ and WRITE operation can be performed without any communication. If the object is not replicated and not stored locally, the operation always results in an RPC to the remote processor. One complication with remote object invocations is that operations may block inside a guard (see Section 3). In this case, we create a continuation [12] at the remote processor containing the operation and its parameters. Whenever the object is changed, this data structure is picked up and the operation

is tried again. An alternative solution (which we originally used) is to create a new thread of control for the operation, but this has high context switching and memory overhead on many architectures [6]. 4.2.3. Orca processes Let us finally look at the protocol used to create new Orca processes. Each process is implemented as a thread. Whenever a new process is created, the manager information (see Figure 4) of all objects that are passed as a shared parameter must be updated. We implement this by broadcasting a message containing the type of the new process and the identities of the shared objects. The processor on which the new process will run may not have copies of all relevant replicated objects yet, since we only replicate objects on those processors that have at least one process referencing them. If no local copy is available yet, a copy is fetched (using an RPC) from the processor that requested the remote fork. If any updates to the object arrive during this RPC call, these updates will be queued by both processors and executed as soon as the copy has been delivered. In this way, we guarantee that the copy remains consistent. 4.3. Panda The Orca RTS is implemented on top of an abstract machine called Panda, which is the third layer in our system. Panda provides threads, Remote Procedure Call, and totally-ordered group communication. Panda internally consists of two layers (see Figure 5). The System layer is partly operating system dependent and provides point-to-point communication and multicast (not necessarily reliable). The Interface layer uses these primitives to provide a higher level interface to the RTS. The Interface layer does not access any operating system primitives. Runtime system Interface layer (RPC + Group Communication) Panda System layer (unicast + multicast + threads) Operating system

Figure 5: The layers of Panda. An important issue is in which layer the reliability and ordering semantics should be implemented. Panda is designed to be flexible and allow both layers to implement any of these semantics. For example, on most multicomputers the hardware communication primitives are already reliable, so it is most efficient to have the System layer provide reliable communication. In this case the Interface layer is simple. If the communication primitives provided by the System layer are unreliable, the Interface layer implements protocols to make communication reliable. The Panda RPC protocol is a 2-way stop-and-wait protocol. The client sends a request message to the server. The server executes the request and sends back a reply message, which also acts as an implicit acknowledgement for the request. The protocol is part of the Panda Interface layer, so it

generally runs on top of unreliable unicast. However, if the Panda System layer provides reliable unicast, the RPC protocol can be configured (at Panda compile time) to make use of this. We have designed several protocols for totally-ordered group communication. One of them is similar to the Amoeba protocol [23]. It uses a sequencer to order all messages. The protocol dynamically switches between two variants: PB and BB. PB (Point-to-point then Broadcast) is used for small messages. To broadcast a message with PB, the sender passes the message to the sequencer, which tags it with the next sequence number and then does the actual broadcast. The receivers use the sequence numbers to determine if they have missed a message, in which case they ask the sequencer to send the message again. For this purpose, the sequencer keeps a history of messages that may not have been delivered at all machines yet. The protocol has several mechanisms to prevent overflow of the history buffer. For large messages, both Amoeba and Panda use a variant of the protocol that uses less bandwidth, called BB (Broadcast - Broadcast). With BB, the senders broadcast the message themselves and the sequencer broadcasts a (small) acknowledgement message containing the sequence number [23]. Another totally-ordered group communication protocol we designed is called GSB (Get Sequence-number and Broadcast). With this protocol, the sender first requests a global sequence number by sending a point-to-point message to the sequencer. The sequencer replies with another point-to-point message. The sender then broadcasts the data message tagged with its sequence number. Although this protocol uses three (instead of two) messages, it requires the sequencer to only send point-to-point control messages. We use the GSB protocol for machines on which short point-to-point messages are inexpensive. The reliability and total-ordering semantics may also be provided by the System layer multicast primitive. If so, the group communication protocol will be configured at compile time to exploit these semantics, just as with the RPC protocol. In this case, the main functionality of the Interface layer group communication protocol is the maintenance of group membership information. Panda thus can take advantage of facilities provided by the operating system or the underlying hardware. For example, it only uses expensive software protocols for making communication reliable if they are really needed. By doing a careful design of the Panda interface, we strive to make Panda (and therefore Orca programs) portable and efficient. Unlike many other Distributed Shared Memory (DSM) systems, Panda does not require virtual memory support, which many parallel machines do not provide. The only primitive that Panda requires from the underlying hardware or operating system is unreliable message passing. This makes the shared data-object model very portable. It can even run on operating systems and machines (e.g., J-machine) that do not support virtual memory. Finally, we note that, although Panda was initially designed to support a portable implementation of Orca, it has also been used successfully for several other programming systems (e.g., PVM and SR), as described in [40]. 5. PORTABILITY OF THE ORCA SYSTEM To illustrate the portability of the Orca system we describe four implementations for different hardware and software platforms. The new system described in this paper was developed initially on a collection of Sun-4 workstations running SunOS 4.3. Since SunOS does not provide threads, Panda’s thread interface has been implemented on top of a user-level threads package for SunOS: Pthreads [33]. The communication interface (i.e., RPC and group communication) has been

implemented on top of UDP/IP. This implementation uses SunOS kernels that have been extended with Deering IP-multicast [10], which supports hardware multicasting. In this section, we will look at several other interesting implementations, to study the issues involved in porting the Orca system. The system has been ported to several operating systems (including Amoeba, Solaris, and Parix) and machines (including the CM-5, Parsytec GCel, Parsytec PowerXplorer, CS-2, and a Myrinet cluster); ports to other systems (including the SP2) currently are under way. 5.1. The Amoeba implementation The first system we consider is the Amoeba distributed operating system [44], which runs on a collection of processors connected by a LAN. Like Panda, Amoeba supports threads, RPC, and totallyordered group communication. In addition, Amoeba offers unreliable communication primitives for sending a message to one process (unicast) or a group of processes (multicast). These primitives are provided by FLIP (Fast Local Internet Protocol), which is a network layer protocol. FLIP differs from the more familiar Internet Protocol by supporting location transparency, broadcasting, and security [24]. There are two radically different ways to implement Panda on Amoeba (see Figure 6). One way is to use the Amoeba RPC and group communication protocols to implement the corresponding Panda primitives. So, Amoeba is treated as a system providing reliable RPC and totally ordered group communication. This implementation wraps the Amoeba primitives to make them directly usable by the Orca RTS. Kernel-space implementation Interface layer

Wrapper routines

Panda RPC

Panda Group Comm.

unreliable

System layer

Amoeba Microkernel

User-space implementation

communication RPC

Group FLIP

FLIP

Figure 6: Two implementations of Panda on Amoeba. An entirely different approach is to bypass the Amoeba communication protocols and directly access the low-level FLIP communication primitives (see right part of Figure 6). For this implementation, we have used Panda’s RPC and group-communication protocols described in Section 4.3. The first implementation thus uses the Amoeba kernel-space communication protocols, whereas the second implementation runs the Panda protocols in user space. Although running protocols in user space potentially has some performance overhead, the impact of this overhead on Orca applications is quite small, as shown in [36]. The main advantage of user-space protocols is increased flexibility. For example, it is not possible to implement continuations (see Section 4.2.2) efficiently on top of the Amoeba kernel-space protocols, because Amoeba RPC requires the receiving thread to also send the reply. We therefore decided to adapt the system with user-space protocols for our research and development.

The Panda protocols that were initially developed on SunOS remained unchanged during the port to Amoeba. We only had to implement Panda’s system level primitives on top of FLIP. All modifications to the system thus were localized in the system-dependent part of Panda (see Figure 6). The System layer uses one daemon thread to receive FLIP packets. The daemon also reassembles fragmented messages. Panda threads are implemented as Amoeba threads, which are created and scheduled (preemptively) by the kernel. Mutexes are provided for synchronization between threads of one process. The Panda system uses them to implement mutexes and condition variables, which are provided to the interface user (i.e., the Orca RTS). 5.2. The Solaris implementation The Solaris implementation of Orca has the same structure as the user-space implementation on Amoeba. Panda threads are implemented as unbound threads in Solaris. Also, the Panda synchronization primitives (mutexes and condition variables) can easily be mapped onto corresponding Solaris primitives. The Panda system layer is implemented on top of the socket interface to UDP/IP, using IP unicast and multicast. The system layer uses two receive daemon threads, one for unicast and one for multicast messages. Panda RPC and group communication are implemented on top of this system layer, just as for the user-space implementation on Amoeba. 5.3. The Parix implementation Parix [37] is a parallel operating system developed by Parsytec. We have ported Orca to a Parix system running on a Parsytec PowerXplorer. The PowerXplorer contains PowerPC processors which are interconnected in a grid, using T800 transputers as communication processors. The system has no hardware support at all for multicasting, so it allows us to study the implementation and performance of Orca on machines that only have point-to-point communication. Parix already offers threads, so Panda threads can easily be implemented as Parix threads. Likewise, Parix offers synchronization primitives that can be used to implement the primitives required by Panda. Since the point-to-point communication provided by Parix is reliable, little effort was required to implement Panda RPC. The Panda System layer provides reliable unicast to the Interface layer, which implements a very simple RPC protocol on top of this primitive (see Section 4.3). It was considerably more difficult to implement the totally-ordered group communication part of Panda efficiently, since neither the hardware nor the operating system provides any form of broadcasting. We implemented reliable broadcasting in software, by setting up a spanning tree among the processors. Each processor has several routing daemon threads (for each incoming and each outgoing link). A broadcast starts by sending out the message on all links. When a broadcast message is received on a link, the daemon forwards the message to its neighbors down the spanning tree and delivers the message to the application. This forwarding of messages to multiple links is done in parallel without copying the message. The most difficult part in the implementation was how to make the group communication totally ordered. We implemented the PB, BB, and GSB algorithms described in Section 4.3. The disadvantage of PB and BB is the large message traffic to and from the sequencer. For each group message, the sequencer does one multicast. (For PB it multicasts the data, and for BB it multicasts the

acknowledgement). The sequencer and the communication channels near the sequencer therefore are likely to become a bottleneck when scaling to large numbers of processors. GSB avoids this problem, because it lets the sequencer send and receive only short point-to-point control messages. GSB, however, uses three messages, whereas PB and BB use two messages. Latency and throughput measurements show that, on Parix, the GSB algorithm in general is the best choice. Our implementation of the GSB algorithm uses a simple and efficient flow control mechanism to avoid running out of buffer space [17]. The multicast protocol is part of the Panda System layer, so it can directly access the Parix primitives. The Panda Interface layer is configured to use reliable, totally-ordered multicast, so the group communication protocol is very simple (see Section 4.3). 5.4. The CM-5 implementation The Thinking Machines CM-5 [19] is a multicomputer with a fast network interconnected as a fat tree. The CM-5 runs the CMOST operating system, which is a stripped-down version of SunOS. In contrast with SunOS, the CM-5 provides user-level communication primitives. The basic communication mechanisms on the CM-5 are active messages [13] and block transfers. Like the PowerXplorer, the CM-5 provides reliable unicast but no hardware multicasting. Our Panda implementation therefore uses a similar approach as on Parix, which is to provide reliable unicast and reliable totally-ordered multicast in the System layer. The RPC and group communication protocols, which are identical to the protocols used in the Parix implementation, run on top of these primitives. Reliable unicast is implemented using active messages. The unicast primitive needs two active messages and one block transfer to send a Panda message. The multicast implementation uses a spanning tree broadcast to send multicast messages (as in the Parix implementation). Each hop down the spanning tree involves two active messages (to request buffer space at the receiver) and one block transfer. We did not yet implement a flow-control mechanism for multicast messages on the CM-5 (as we did on Parix). If a receiver does not have a buffer available, the sender will therefore repeatedly ask it for a buffer, until a buffer becomes available. For most applications, this polling behavior is not a problem. Applications usually do not generate a continuous stream of broadcasts and therefore seldom run out of buffer space. Totally-ordered multicast is implemented with the GSB protocol (see Section 4.3). Messages are ordered by requesting a sequence number from the sequencer node. This requires two active messages per multicast. Doing the sequencing in the implementation of the System layer multicast primitive allows the use of active messages. This is more efficient than implementing the sequencing on top of a unicast and an unordered multicast primitive. To implement threads on the CM-5, Panda uses a user-level threads package that cooperates with active messages: whenever all threads are idle, the scheduler polls for pending active messages. The network is only polled when all threads are idle or when a message is sent. Optionally, the Orca compiler can also generate polling calls at the beginning of every loop or procedure entry. The overhead of polling on application performance is substantial, however, (see Section 6.3), so we do not use this option for the CM-5.

6. PERFORMANCE In this section we evaluate the performance of the new Orca system and show that it is both efficient and portable. To be more precise, we state the following two claims: 1.

The new portable system performs as well as the original non-portable system.

2.

Applications perform well on a range of architectures without any modifications to their source code, although a larger problem size may be needed for architectures with slow communication.

Below, we first analyze the performance of the various layers in the Orca system, by measuring and analyzing the costs of the low-level primitives. We also compare the costs of the Orca primitives of the original and the new Orca implementations. Next, we study the speedups of several realistic applications on Amoeba, again using both the original and new system. Finally, we look at two applications in more detail, and study their performance on several different platforms, using different problem sizes. The measurements for Amoeba were done on a system with 64 Micro-SPARC processors running at 50 MHz. The processors contain 32 MB of local memory and are connected by a 10 Mbit/sec switched Ethernet (using a Kalpana Ethernet switch). For the Solaris measurements we used a collection of 80 Mhz Micro-SPARC-II workstations (with 32 MB memory) connected by a 10 Mbit/sec Ethernet, running Solaris 2.4. The measurements for the Parix implementation were done on a Parsytec PowerXplorer. It contains PowerPC601 processors running at 80 Mhz, each with 32 MB memory. The processors are interconnected in a grid, using T800 transputers and 20 Mbit/sec links. The CM-5 used for our measurements contains SPARC processors running at 32 MHz. Each processor has 32 MB of memory. The nodes are interconnected by a fat tree network, which provides increasing bandwidth near the root of the tree. The raw bandwidth between the nodes is approximately 20 MByte/sec per direction. 6.1. Low-level performance We first study the cost of several low-level primitives, including Panda’s Remote Procedure Call and group communication and Orca’s object operations. All measurements we report are from user process to user process. The Panda communication protocols Table 1 shows the latency for empty messages using Panda RPC and group communication on the various systems, expressed in milliseconds as well as number of CPU cycles. RPC null-latency was measured by sending empty request and reply messages between two nodes. Since the latency depends on the network topology and the distance between the nodes, we have measured the latency between all pairs in a configuration with 16 nodes. The average (two-way) latency over all pairs is reported in Table 1. Group latency was measured using a group of 16 members, where two of the members perform a ping-pong test. Each of these two members broadcasts a group message when it receives a message from the other member. The 14 remaining members only receive (and delete) these messages. We report the average one-way latency measured over all pairs in a 16-node configuration.

2222222222222222222222222222222222222222222222222 1 Amoeba 1 Solaris 1 Parix 1 CM-5 1 12222222222222222222222222222222222222222222222222 1 1 1 1 1 1 RPC 1 1.61 1 2.67 1 0.86 1 0.38 1 1 1 1 1 1 1 12222222222222222222222222222222222222222222222222 1 (80,500) 1 (213,600) 1 (68,800) 1 (12,160) 1 1 1 1 1 1 1 1 Group 1 1.64 1 3.51 1 2.30 1 0.49 1 1 1 (82,000) 1 (280,800) 1 (184,000) 1 (15,680) 1 12222222222222222222222222222222222222222222222222 1 1 1 1 1 1

Table 1: Null-latency in milliseconds (and CPU cycles) for Panda RPC and group communication. Figure 7 shows the throughput for Panda’s RPC and group communication for various message sizes. The RPC throughput was measured by sending request and reply messages of equal size between a pair of nodes. The average throughput over all pairs in a 16-node configuration is plotted in Figure 7. The group throughput is measured by having one member in a group of 16 send messages as fast as possible to the other members. Since the maximum throughput may depend on the distance to the sequencer, the throughput reported in Figure 7 is the average over all members. Throughput in Kbyte/s 7000

Average Panda RPC throughput

Throughput in Kbyte/s 7000

CM-5 Amoeba Solaris PowerXplorer

6000

6000

5000

5000

4000

4000

3000

3000

2000

2000

1000

1000

0

Average Panda group communication throughput

CM-5 Amoeba Solaris PowerXplorer

0 0

2048

4096 6144 Message size in bytes

8192

0

2048

4096 6144 Message size in bytes

8192

Figure 7: Panda RPC and group communication throughput. Obviously, the latencies and throughputs depend very much on the underlying hardware and operating system. Amoeba and Solaris both use Ethernet. On Amoeba, we obtain better latencies than on Solaris, mainly because Amoeba’s low-level FLIP primitives are faster (by about 400 microseconds) than the Solaris UDP/IP primitives, and because thread-safe library calls are expensive on Solaris (making the calls thread-safe costs about 300 microseconds). The CM-5 has a much faster network. The RPC performance on the CM-5 somewhat depends on the distance between the sender and receiver. Since the machine uses wormhole routing, the differences in latencies are fairly small. We have measured latencies between 362 and 398 microseconds. The PowerXplorer does not use wormhole routing, so the latency is more sensitive to the distance; we have measured latencies between 710 and 1110 microseconds on this machine. On both the CM-5 and the PowerXplorer, the

latencies for group communication also are more sensitive to the distance between the nodes, because forwarding is done in software. We have measured broadcast latencies between 272 and 755 microseconds for the CM-5 and between 1400 and 3500 microseconds for the PowerXplorer. On Amoeba, Panda achieves a maximum throughput of 914 Kbyte/sec for RPC, which is fairly close to the Ethernet bandwidth. For group communication the throughput is somewhat less (735 Kbyte/sec), because the sender must wait until it receives the message from the sequencer (containing the sequence number) before it can broadcast the next message. The decrease in throughput for 8 Kbyte messages is caused by message fragmentation. To tolerate operating systems that do not support messages of arbitrary lengths, the Panda communication protocols include fragmentation code to send large messages in chunks of 8 Kbytes. Since each chunk must include a Panda header, the messages with 8 Kbytes data are fragmented into two chunks. The throughput for group communication on the CM-5 is far from optimal, because we did not yet implement flow control for broadcasting on this platform. If a receiver runs out of buffer space, the sender therefore repeatedly polls it until a buffer becomes available. This situation occurs in the group throughput benchmark, because there is only one sender, which broadcasts messages continuously. We will study the overhead of the Panda user-level protocols by analyzing the latency for RPC and broadcast on Amoeba. The Panda RPC null-latency is 1.61 msec. In comparison, the RPC protocol provided by the Amoeba kernel achieves a latency of 1.27 msec on the same hardware. So, the overhead of the Panda protocol is about 0.3 msec. We have analyzed this difference in detail to determine which fraction of it is due to fundamental limits of a user-space protocol [36]. To summarize, Panda RPC uses two additional context switches compared with Amoeba, caused by running the protocol in user-space. This accounts for 140 µsec of overhead. In addition, there are penalties for an increased number of register window traps and address space crossings (50 µsec), increased functionality (fragmentation) (40 µsec), and the low-level FLIP interface (50 µsec), but these costs are not fundamental to user-space protocols. Likewise, the latency for the Panda group communication protocol is about 0.23 msec higher than that for the Amoeba kernel. This difference is mainly caused by an expensive (110 µsec) context switch on the sequencer machine to schedule the user-space sequencer thread after a broadcast request arrives. In contrast, the Amoeba kernel activates the sequencer from within the (software) interrupt handler. Other overhead factors are similar to those for RPC (additional address-space crossings, register window traps, fragmentation, FLIP interface). Running the RPC and group communication protocols in user space thus has an overhead that is not negligible, but it offers much flexibility that can be exploited by higher layers. The Orca runtime system Another important low-level primitive is operation invocation in Orca. We use two benchmarks to measure the costs of invocations. ROI (RPC object invocation) is a benchmark using a nonreplicated object containing a single integer. The object is stored on one processor and another processor performs a (remote) increment operation on the object, using Panda RPC. The time needed to invoke a remote operation is reported in Table 2. For comparison, this table also shows the latency of ROI on the original (nonportable) Orca system described in [1]. To allow a fair comparison, we use the same Orca compiler for the old and the new system. The results thus reflect differences in the runtime systems and communication protocols.

To measure the throughput of ROI, the increment operation takes a string parameter (i.e., an array of characters) as input and returns the same string as a result. The throughputs of ROI for various string lengths is given in Figure 8. GOI is a benchmark that measures the invocation time of Orca operations on a replicated integer object. The latency is measured by having two members in a group of 16 in turn perform an increment operation on the replicated object. The reported latency in Table 2 is the average latency measured over all possible pairs. The throughput for GOI is measured by having one node performing increment operations on the replicated object as fast as possible. Like ROI, the operation takes a string input parameter, so the throughput for various string lengths could be measured. The throughputs reported are the average values over 16 members. 2222222222222222222222222222222222222222222222222222222222 1 1 original 1 Amoeba 1 Solaris 1 Parix 1 CM-5 1 12222222222222222222222222222222222222222222222222222222222 1 1 1 1 1 1 1 ROI 1 1.52 1 1.87 1 3.23 1 1.23 1 0.525 1 1 1 1 1 1 1 1 12222222222222222222222222222222222222222222222222222222222 1 (76,000) 1 (93,500) 1 (258,400) 1 (98,400) 1 (16,800) 1 1 1 1 1 1 1 1 1 GOI 1 1.69 1 2.03 1 3.70 1 2.95 1 0.758 1 1 1 (84,500) 1 (101,500) 1 (296,000) 1 (236,000) 1 (24,256) 1 1 2222222222222222222222222222222222222222222222222222222222 1 1 1 1 1 1 Table 2: Null-latency in milliseconds (and CPU cycles) for Orca level operation invocations on the original RTS (on Amoeba) and on the four Panda-based RTSs.

Throughput in Kbyte/s 3500

3000

Orca-level remote object invocation throughput

Throughput in Kbyte/s Average Orca replicated object invocation throughput 3500

3000

Original CM-5 Amoeba Solaris PowerXplorer

2500

2500

2000

2000

1500

1500

1000

1000

500

500

0

Original CM-5 Amoeba Solaris PowerXplorer

0 0

2048

4096 6144 Message size in bytes

8192

0

2048

4096 6144 Message size in bytes

8192

Figure 8: RPC and group Object Invocation throughput. Let us first compare the performance of the original and new Orca system on Amoeba. For the lowlevel ROI and GOI primitives, the portable system has somewhat higher latencies. The main reason is that the portable system is based on user-space communication protocols, whereas the original

system uses Amoeba’s kernel-space protocols. ROI uses one RPC, which costs 0.3 msec more with the Panda user-level protocol than with the Amoeba kernel protocol. GOI uses a broadcast message, which has a 0.23 msec higher latency on Panda. ROI also achieves a higher throughput with the original Orca system. For GOI, the Panda system achieves a higher throughput for small messages, because the Amoeba kernel copies incoming broadcast packets three times to deliver them to the application, whereas the Panda system copies them only twice. In this case, it thus pays off to implement a user-space protocol. For larger messages, Panda suffers from fragmentation overhead, which decreases throughput. We should point out, however, that the original system imposes an upper bound on the size of broadcast messages (because it lacks fragmentation); as a result, applications that need large broadcast messages do not run on the original system (see also Section 6.2). From Tables 1 and 2 we can also determine the overhead of the Orca RTS compared to Panda RPC and group communication. This overhead is roughly a few hundred microseconds on Amoeba. The overhead costs of the RTS on Amoeba are analyzed in detail in [6]. Most of the overhead (over 40%) is in marshalling and unmarshalling of the parameters of an operation. Also, the increased header sizes in request and reply messages account for 30% of the overhead. Finally, to maintain sequential consistency, requests and replies are tagged with a timestamp (a sequence number) that is checked by the receiving side [14]. Generating and checking the timestamps accounts for almost 15% of the overhead. For Amoeba, Solaris, and the PowerXplorer, the throughputs at the Orca level (shown in Figure 8) are close to those at the Panda level (Figure 7). For the CM-5, the Orca level throughputs are substantially lower. The Orca RTS copies the string parameter of ROI and GOI twice (for marshalling and unmarshalling); the operation itself also copies the parameter once, to assign it to the result parameter. On the CM-5, the relative overhead of this copying is high compared to the basic communication time, so it decreases the throughput. 6.2. Performance for parallel applications on Amoeba Several people have implemented a large number of parallel applications in Orca [47, 36]. In this section, we will study the performance of several of these applications on Amoeba, using both the original and the new Orca system. The goal of this comparison is to show that the portable system performs well compared to the original system. We have selected six applications for our performance study. Together, they are representative for the kinds of applications for which Orca typically is used. They include numerical as well as symbolic applications. Some applications mainly use point-to-point communication, some use broadcasting, and some use both. Below, we briefly describe the applications and the problem sizes we use. -

ACP (Arc Consistency Problem) [30] takes as input a set of 1,000 variables and constraints on the values of these variables. It determines which values each variable can take, by repeatedly eliminating values that do not satisfy the constraints.

-

ASP (all-pairs shortest paths problem) finds the shortest path between any pair of nodes in a given 1,600-node graph.

-

IDA* is a combinatorial search algorithm that performs repeated depth-first searches. We used it to solve an instance of the 15-puzzle.

-

RA (Retrograde Analysis) computes a 14-piece end-game database for Awari (a 48-piece board

game). It starts at the terminal positions with 14 or fewer pieces on the board and reasons backwards to compute the game-theoretical value of all positions in this database. -

SOR (Successive overrelaxation) is a well-known iterative method for solving discretized Laplace equations on a grid. We use a 482×1,000 grid as input.

-

Water is an application from the Splash benchmark suite [42]. We have rewritten this program in Orca and use it to simulate the behavior of a system with 1,024 water molecules.

First, we have used the applications to evaluate our strategy for implementing shared objects. In particular, we are interested in knowing whether the RTS makes the right decisions about object replication and object placement. For this purpose, we have added a special “strategy” call to the RTS that tells the RTS explicitly whether to replicate an object and where to store nonreplicated objects. For each application, we manually inserted these strategies calls, using the best strategy for each object. We have compared the performance of each application with and without these calls. In the latter case, the RTS makes the decisions automatically, as described in Section 4.2.1. For each application, the differences in execution time between the two versions are very small: the strategy calls improved the performance by at most 2%. This experiment shows that, at least for the six applications listed above, the RTS is able to make correct decisions about object placement. Second, we have used the applications to compare the original and new Orca systems. Figure 9 shows the speedups for the six applications on the Amoeba processor pool, using the Panda-based Orca system and the original Orca system. The speedups are computed relative to the parallel program on one CPU. Also, we did not include time for I/O in these measurements, since Amoeba does not support any form of parallel I/O. The execution times on a single CPU are similar on the two Orca implementations. The original RTS fails to make correct decisions about object placement for three applications (IDA*, RA, and Water), resulting in very poor performance. To allow a meaningful comparison, we therefore used the version with explicit strategy calls for all applications. Speedup

Speedups of Orca applications with the Panda runtime system

Speedup

64

Speedups of Orca applications with the original runtime system 64

Perfect speedup ACP ASP IDA* RA SOR Water

56

48

Perfect speedup ACP ASP IDA* RA SOR Water

56

48

40

40

32

32

24

24

16

16

8

8

0

0 0

8

16

24 32 40 Number of processors

48

56

64

0

8

16

24 32 40 Number of processors

48

56

Figure 9: Speedups for six applications on Amoeba, using the portable Panda-based RTS (left) and the original Amoeba RTS (right). With the original Orca system, Water and RA do not run on 64 processors, because they use a broadcast message that exceeds the maximum message size. Panda does not suffer from this

64

problem, because it fragments large broadcast messages (see Section 6.1). As can be seen, the Panda system achieves high speedups for most applications. Even on 64 processors, the relative efficiency usually is around 50%, despite the fact that we use a 10 Mbit/sec Ethernet network. In a few cases (e.g., RA), we had to use large input problems to obtain such efficiency. Several other programs (ASP, ACP, IDA*) use rather small input problems and take about half a minute elapsed time on 64 processors. The only application that does not scale at all is SOR. In this iterative program, each processor exchanges data with its left and right neighbor at the end of every iteration. Since much communication is taking place at the same time, the Ethernet becomes a bottleneck for this application. If we compare the speedups of the original and new Orca systems, we see that the new portable system achieves at least the same speedups on all six applications. In conclusion, the new system has somewhat worse performance on low-level benchmarks, but the application performance is as good as that of the original nonportable system. 6.3. Performance for parallel applications on different platforms In the previous section we have shown that on the same hardware platform the portable Orca system performs as well as the Orca system designed for Amoeba. In this section we compare the performance of the portable Orca system on different hardware platforms. The goal of this experiment is to show that the same applications perform well on a range of architectures, provided that the problem size is large enough for the architecture. To this purpose, we have selected two applications, Water and ASP, that can easily be executed with different problem sizes and that will in general give better speedups for larger problems. For Water, the problem size is the number of molecules; for ASP, it is the number of nodes in the graph. Both programs generate a significant amount of communication. If a larger problem is used for these programs, the increase in computation time will be higher than the increase in communication time, so the relative communication overhead decreases. We have executed both applications on all platforms, using different problem sizes and a fixed number (16) of processors. The efficiencies (defined as speedup divided by the number of processors) for different problem sizes are shown in Figure 10. The time complexities of the algorithms are quadratic for Water and cubic for ASP. The horizontal axis in Figure 10 reflects the time complexity and denotes the number of basic operations rather than the problem size. Our measurement points were chosen taking the algorithmic complexity into account. To analyze the results, we have measured the total number of operations invoked per second (by all 16 processors together), for a small input problem (see Tables 3 and 4). Here, a remote invocation denotes an invocation on an object stored on a remote processor; such operations are implemented with an RPC. A group invocation denotes a write operation on a replicated object, which is implemented using broadcasting. As can be seen from the tables, the two applications differ in their communication behavior. Water primarily uses remote invocations (i.e., nonreplicated objects). Each processor contains an object with information about the molecules it owns. Data on other processors can be accessed using remote object invocations. ASP mainly uses group invocations. It uses a replicated object for storing pivot rows. Whenever a processor adds a new row to this object, the row will be broadcast to all machines, which use it to update their part of the distance matrix.

Efficiency 100 %

Efficiency of Water on 16 processors

Efficiency 100 %

80 %

80 %

60 %

60 %

40 %

CM-5 Amoeba Solaris PowerXplorer

40 %

20 %

0%

Efficiency of ASP on 16 processors

CM-5 Amoeba Solaris PowerXplorer

20 %

02

2002

4002

6002 8002 10002 12002 Number of basic operations

14002

16002

0% 8003

10003

12003 14003 16003 Number of basic operations

18003

Figure 10: Efficiency of the Water and ASP programs on 16 CPUs.

22222222222222222222222222222222222222222222222222222 1 1 Amoeba 1 Solaris 1 Parix 1 CM-5 1 22222222222222222222222222222222222222222222222222222 1 1 1 1 1 1 1 Remote invocations 1 401.0 1 757.3 1 1528.3 1 503.0 1 22222222222222222222222222222222222222222222222222222 1 1 1 1 1 1 1 Group invocations 1 77.6 1 146.7 1 296.0 1 97.4 1 1 22222222222222222222222222222222222222222222222222222 1 1 1 1 1 Table 3: Invocations per second for Water, using 16 CPUs and a problem with 574 molecules.

2222222222222222222222222222222222222222222222222222 1 1 Amoeba 1 Solaris 1 Parix 1 CM-5 1 2222222222222222222222222222222222222222222222222222 1 1 1 1 1 1 1 Remote invocations 1 2.9 1 2.9 1 4.4 1 2.4 1 2222222222222222222222222222222222222222222222222222 1 1 1 1 1 1 1 Group invocations 1 75.7 1 78.6 1 118.5 1 62.9 1 1 2222222222222222222222222222222222222222222222222222 1 1 1 1 1 Table 4: Invocations per second for ASP, using 16 CPUs and a problem with 800 nodes. The results in Figure 10 show that the CM-5 can handle the smallest problem size. For example, the CM-5 obtains 41% efficiency for Water for an input problem with 100 molecules, whereas the other systems achieve an efficiency below 25% for this problem size. For ASP, the CM-5 also obtains the highest efficiency, even though this program primarily uses broadcast communication, which is implemented entirely in software on the CM-5. The application did not suffer at all from the lack of flow control for broadcasting, because (unlike the lower level benchmarks) it does not generate broadcast messages continuously. We also measured the performance of a slightly different implementation of Water and ASP on the CM-5, where the compiler generates a polling statement at the beginning of every loop or procedure entry (to receive messages more quickly). The overhead of polling turned out to be as high as 30%, so adding polling statements decreases performance.

20003

For Solaris, we need a large problem size to obtain a high efficiency. There are two reasons for this. First, the Panda communication protocols for Solaris are slower than on other platforms (because of the overhead of UDP/IP and thread-safe library calls). Second, the processors used for the Solaris measurements are faster than those for most other platforms, so the relative communication overhead is higher. On the PowerXplorer, we also need a relatively large problem size, because the system also uses relatively fast processors and slow communication. The performance of ASP lags behind that of other machines, because the software spanning-tree broadcast protocol is relatively expensive. (The Water program on the PowerXplorer obtains an efficiency over 100% for 1500 molecules, which is due to caching effects, so this data point is omitted from Figure 10.) In conclusion, both applications achieve high efficiencies on all platforms, without requiring any modification to the Orca source programs. For Water, the efficiency for large input problems ranges from 74% (for Solaris) to 92% (for the CM-5). For ASP, the maximum efficiency ranges from 60% (for the PowerXplorer) to 88% (for the CM-5). 7. DISCUSSION We have been working on implementations of shared objects on various platforms for several years now. Also, various people have used the resulting system for writing realistic parallel applications. In particular, numerous students and visitors at Vrije Universiteit have written parallel applications in Orca. In this section, we summarize the lessons we have learned from implementing and using the Orca system. The lessons learned are organized along the three main goals of Orca: ease of use, portability, and efficiency. 7.1. Ease of use The users generally agree that Orca is very easy to learn and use. Many users had little or no prior experience in parallel programming, but were familiar with one or more imperative sequential languages. The main new concept they had to understand was shared objects. The users agreed that implicit communication through shared objects is easy to program with. Type-security of the language turned out to be a very useful property, since it eases debugging and testing of programs. The main feature missing in our programming model is the ability to partition objects. The RTS will replicate or migrate entire objects, but it will not partition a single object among multiple machines. For some applications, however, this would be desirable. We are working on a dataparallel extension of our model that will support partitioned objects [16]. Another restriction in our model is that operations are always applied to single objects. In our experience, many applications do not need indivisible operations on multiple objects, but occasionally such functionality would be useful. We are currently working on another extension of our model, the atomic function, that provides this functionality in an efficient way, using compiler-generated information for optimizing the communication overhead [39]. The most difficult problem application programmers are faced with is performance tuning. As the language has a high level of abstraction, programmers do not always understand the performance implications of the choices they make. We addressed this issue by providing a performance visualization tool. We have modified the Upshot tool from Argonne [18] to have it support shared objects [21]. The modified tool supports visualization at two different levels of abstraction. Most application writers will use the highest level, which visualizes processes and operations on objects.

The second level provides more detail and also shows the messages generated for operations on shared objects. Using this level requires more understanding of what is going on behind the scenes, without having to know all the implementation details of the system. 7.2. Portability An important goal in our research is to make both Orca application programs and the programming system portable. Most Orca applications were developed initially on Amoeba and subsequently ported to other platforms. As we have shown in Section 6, many programs achieve good performance on a range of architectures, without any changes to the source code. One possible disadvantage of our approach is the fact that the decomposition strategy is built into the application program. For some applications, the best decomposition strategy may depend on the underlying hardware [7]. ASP, for example, uses a row-wise partitioning of the matrix, since that is most efficient on an Ethernet-based system. On MPPs, however, a block-wise partitioning can be more efficient. Languages like HPF [28] separate the decomposition from the main algorithm, which makes it easier to change the strategy while porting the application to a different target platform. Such languages, however, often are only suitable for data-parallelism. We address this problem with the partitioned objects mentioned above. Also, this extension allows us to support mixed task and data parallelism within a single program [16]. With regard to the portability of the Orca programming system, it should be clear that the compiler and runtime system are very easy to port. Virtually all changes made to these parts of the system were due to bugs in either our own software or in the target C compiler. In contrast, porting the Panda layer requires some effort, since it is partly platform-dependent. In all cases, the initial system was running quickly, but in some cases substantial time was spent tuning the system. 7.3. Efficiency Dynamic replication of shared data is essential to obtain good performance in DSM systems. We believe that a user-defined object is a good unit for replication (and synchronization), because it avoids problems with false sharing and it makes update protocols effective. An important issue is how to decide where to store objects and which objects to replicate. Our goal is to have the system make these decisions. As we have shown in Section 6.2, the system is able to make a correct decision for all applications discussed in this paper, using both compiler-generated information and runtime statistics of object invocations. Another important issue is the performance overhead of the layered approach. On traditional distributed systems, where communication overhead is high, the relative overhead of our approach is moderate. On an Ethernet-based Amoeba system, for example, the latency for a remote object invocation is 1.87 msec, using the user-space Panda protocols (see Section 6.1). In comparison, an Amoeba RPC (using the kernel-space Amoeba protocol) takes 1.27 msec. Our layered system, however, is not suited for (or even designed for) exploiting fine-grained parallelism. An approach to allow more fine-grained parallelism has been investigated in [46]. The idea is to augment the Panda approach by giving the compiler access to user-level communication primitives. The compiler generates message handling code for each operation. This code is executed as an active message [13] handler whenever a request for the operation arrives. In contrast, the Panda system will deliver each incoming message to a server thread, which unmarshalls the request and then

executes the operation. Even though the Panda system can use active messages at the lowest level, it still has a high overhead of context switching and marshalling and unmarshalling. The main problem with using active messages is that the active message handler is not allowed to block, since that would result in deadlock. This problem can be solved using a new technique called optimistic active messages [46], which detects dynamically if an operation tries to block, and then creates a new thread to execute it. With this approach, nonblocking operations are executed highly efficiently, and blocking operations have a similar overhead as in the original Panda system. This approach allows much more fine-grained parallelism of Orca on architectures such as the CM-5 [46]. The lesson learned from this is that making the low-level hardware interface invisible to the higher layers of the software is acceptable for systems where communication is slow, but it is not a good idea when the goal is to support fine-grained parallelism on systems with fast, user-level communication. For the latter systems, the extension described above greatly improves performance, at the cost of making the compiler harder to port. The final issue we want to discuss is broadcasting. Much of our work depends on the performance of a reliable broadcasting primitive. In our experience, broadcasting can be implemented efficiently on a wide range of architectures. Some networks (e.g., Ethernet) provide broadcasting in hardware. Even if the hardware primitive is not reliable, software protocols can be used to make broadcasting reliable with little overhead [23]. Many modern networks support reliable point-to-point communication but not broadcast communication. On such systems, we implement a broadcasting primitive using a spanning tree forwarding scheme. We have implemented such schemes on the CM-5, a transputer system [17], and on a Myrinet network [45]. The performance generally is quite good. On an 8-node Myrinet cluster, for example, we are able to do 20,000 (small) broadcasts a second, which is much better than what can be obtained using Ethernet’s hardware broadcast. The most important technical issue we had to deal with is flow control, to prevent overflow of receive buffers [45]. As explained in Section 4, we use totally-ordered broadcasting to obtain coherency of replicated objects. There are many different ways of making broadcasting totally-ordered. A general idea, common to all our protocols, is to use a sequencer to order the messages. The protocols differ in the role the sequencer plays. In the simplest case (the GSB protocol), the sequencer merely hands out sequencer numbers. For the PB and BB protocols, the sequencer broadcasts the data or an acknowledgement message. In theory, the sequencer may become a central bottleneck. In practice, this seldom is a problem, because the sequencer has very little work to do. Totally-ordered broadcasting thus can be implemented efficiently, at least on medium-scale machines. For parallel machines with a very large number of nodes, a directory-based scheme [8, 22, 26] may be an attractive alternative. 8. Summary We have designed and implemented an object-based Distributed Shared Memory system called Orca, which can be used to write parallel applications that run on a range of parallel machines. Many applications have been written in Orca by various people. The experiences of the users is that Orca is easy to learn and use. In addition, Orca programs are portable, and can run unmodified on different platforms. The Orca system itself also is designed to be portable. To achieve portability, we use a virtual machine that (for efficiency reasons) can be configured statically. The entire Orca system has been implemented on several different parallel machines and on collections of workstations.

A key idea in the implementation is to have the compiler compute patterns that describe how shared objects will be used. The runtime system uses this information (together with runtime statistics) to determine which objects to replicate and where to store nonreplicated objects. We have evaluated the performance of the system on several platforms, using benchmarks as well as applications. The results show that the portable Orca system performs as well as the original (nonportable) system and that high efficiencies can be obtained for a range of parallel applications. Note A distribution of the Orca and Panda software (for Solaris 2) is available from the World Wide Web, URL http://www.cs.vu.nl/orca/. References 1.

H.E. Bal and M.F. Kaashoek, ‘‘Object Distribution in Orca using Compile-Time and Run-Time Techniques,’’ Proc. Conf. on Object-Oriented Programming Systems, Languages, and Applications 1993 (OOPSLA), Washington, D.C., pp. 162-177 (26 Sept. 1993 - 1 Oct. 1993).

2.

H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum, ‘‘Orca: A Language for Parallel Programming of Distributed Systems,’’ IEEE Trans. on Software Engineering 18(3), pp. 190-205 (March 1992).

3.

H.E. Bal, M.F. Kaashoek, A.S. Tanenbaum, and J. Jansen, ‘‘Replication Techniques for Speeding up Parallel Applications on Distributed Systems,’’ Concurrency: Practice & Experience 4(5), pp. 337-355 (Aug. 1992).

4.

J.K. Bennett, J.B. Carter, and W. Zwaenepoel, ‘‘Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence,’’ Proc. Second Symposium on Principles and Practice of Parallel Programming, Seattle, WA, pp. 168-176 (March 1990).

5.

B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon, ‘‘The Midway Distributed Shared Memory System,’’ Proc. COMPCON 1993, pp. 528-537 (1993).

6.

R.A.F. Bhoedjang and K.G. Langendoen, ‘‘Friendly and Efficient Message Handling,’’ Proceedings of the 29th Hawaii International Conference of System Sciences, Hawaii, pp. 121130 (Jan. 1996).

7.

E.A. Brewer, ‘‘High-Level Optimization via Automated Statistical Modelling,’’ Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’95), Santa Barbara, CA, pp. 80-91 (July 1995).

8.

D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal, ‘‘Directory-Based Cache Coherence in Large-Scale Multiprocessors,’’ IEEE Computer , pp. 49-58 (June 1990).

9.

A.L. Cox, S. Dwarkadas, P. Keleher, and W. Zwaenepoel, ‘‘TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems,’’ Proceedings of the Winter 94 Usenix Conference, pp. 115-131 (Jan. 1994.).

10.

S.E. Deering and D.R. Cheriton, ‘‘Multicast Routing in Datagram Internetworks and Extended LANs,’’ ACM Transactions on Computer Systems 8(2), pp. 85-110 (May 1990).

11.

E.W. Dijkstra, ‘‘Guarded Commands, Nondeterminacy, and Formal Derivation of Programs,’’ Commun. ACM 18(8), pp. 453-457 (Aug. 1975).

12.

R. P. Draves, B. N. Bershad, R. F. Rashid, and R. W. Dean, ‘‘Using Continuations to Implement Thread Management and Communication in Operating Systems,’’ Proc. Thirteenth Symposium on Operating System Principles, Pacific Grove, CA, pp. 122-136 (Oct. 1991).

13.

T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser, ‘‘Active Messages: a Mechanism for Integrated Communication and Computation,’’ Proc. 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, pp. 256-267 (May 1992).

14.

A. Fekete, F. Kaashoek, and N. Lynch, ‘‘Implementing Shared Objects Using Multicast Communication,’’ 15th International Conference on Distributed Computing Systems, Vancouver, B.C., Canada, pp. 439-449 (May 1995).

15.

B.D. Fleisch and G.J. Popek, ‘‘Mirage: A Coherent Distributed Shared Memory Design,’’ Proc. of the 12th ACM Symp. on Operating System Principles, Litchfield Park, AZ, pp. 211-223 (Dec. 1989).

16.

S. Ben Hassen and H.E. Bal, ‘‘Integrating Task and Data Parallelism Using Shared Objects ,’’ 10th ACM International Conference on Supercomputing, Philadelphia, PA, pp. 317-324 (May 1996).

17.

H-P. Heinzle, H.E. Bal, and K. Langendoen, ‘‘Implementing Object-Based Distributed Shared Memory on Transputers,’’ pp. 390-405 in World Transputer Congress 1994, ed. A. De Gloria, M.R. Jane, and D. Marini (Eds.), IOS Press, Lake Como, Italy (Sept. 1994).

18.

V. Herrarte and E. Lusk, ‘‘Studying parallel program behavior with Upshot,’’ Technical Report ANL-91/15, Argonne National Laboratory, Argonne, IL (1991).

19.

W.D. Hillis and L.W. Tucker, ‘‘The CM-5 Connection Machine: A Scalable Supercomputer,’’ Commun. ACM 36(11), pp. 30-40 (Nov. 1993).

20.

C.A.R. Hoare, ‘‘The Emperor’s Old Clothes,’’ Commun. ACM 24(2), pp. 75-83 (Feb. 1981).

21.

R. Hofman, K. Langendoen, and H.E. Bal, ‘‘Visualizing High-Level Communication and Synchronization,’’ IEEE Int. Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), Singapore, pp. 37-43 (June 1996).

22.

K.L. Johnson, M.F. Kaashoek, and D.A. Wallach, ‘‘CRL: High-performance All-software Distributed Shared Memory,’’ Proc. 15th Symposium on Operating System Principles, Copper Mountain Resort, CO, pp. 213-228 (Dec. 1995).

23.

M.F. Kaashoek, ‘‘Group Communication in Distributed Computer Systems,’’ Ph.D. Thesis, Vrije Universiteit, Amsterdam (1992).

24.

M.F. Kaashoek, R. van Renesse, H. van Staveren, and A.S. Tanenbaum, ‘‘FLIP: an Internet Protocol for Supporting Distributed Systems,’’ ACM Trans. Comp. Syst. 11(1), pp. 73-106 (Febr. 1993).

25.

J.W. Lee, ‘‘Concord: Re-Thinking the Division of Labor in a Distributed Shared Memory System,’’ Report TR-93-12-05, University of Washington, Seattle, WA (1993).

26.

D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M.S. Lam, ‘‘The Stanford DASH Multiprocessor,’’ IEEE Computer 25(3), pp. 63-79 (March 1992).

27.

K. Li and P. Hudak, ‘‘Memory Coherence in Shared Virtual Memory Systems,’’ ACM Trans. Comp. Syst. 7(4), pp. 321-359 (Nov. 1989).

28.

D.B. Loveman, ‘‘High Performance Fortran,’’ IEEE Parallel and Distributed Technology 1(1), pp. 25-41 (Feb 1993).

29.

H. Lu, S. Dwarkadas, A.L. Cox, and W. Zwaenepoel, ‘‘Message Passing Versus Distributed Shared Memory on Networks of Workstations,’’ Proceedings of Supercomputing ’95, San Diego, CA (December 1995).

30.

A.K. Mackworth, ‘‘Consistency in Networks of Relations,’’ Artificial Intelligence 8, pp. 99-118 (1977).

31.

M. Castro , P. Guedes , M. Sequeira, and M. Costa, ‘‘Efficient and Flexible Object Sharing,’’ Report RT 18-95, IST-INESC, Lisboa, Portugal (July 1995).

32.

R.G. Minnich and D.J. Farber, ‘‘Reducing Host Load, Network Load, and Latency in a Distributed Shared Memory,’’ Proceedings 10th International Conference on Distributed Computing Systems, Paris, pp. 468-475 (May 1990).

33.

F. Mueller, ‘‘A Library Implementation of POSIX Threads under UNIX,’’ Proc. USENIX Conference Winter’93, San Diego, CA, pp. 29-41 (Jan. 1993).

34.

R. Nikhil, ‘‘Cid: A Parallel, Shared-memory C for Distributed-memory Machines,’’ Proc. 7th Ann. Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY (Aug. 1994).

35.

B. Nitzberg and V. Lo, ‘‘Distributed Shared Memory: a Survey of Issues and Algorithms,’’ IEEE Computer 24(8), pp. 52-60 (Aug. 1991).

36.

M. Oey, K. Langendoen, and H.E. Bal, ‘‘Comparing Kernel-space and User-space Communication Protocols on Amoeba,’’ 15th International Conference on Distributed Computing Systems, Vancouver, B.C., Canada, pp. 238-245 (May 1995).

37.

Parsytec, Parix Release 1.2 Reference Manual, Parsytec Computer GmbH, Aachen, Germany (1993).

38.

L.L. Peterson, N. Hutchinson, S. O’Malley, and H. Rao, ‘‘The x-kernel: A Platform for Accessing Internet Resources,’’ IEEE Computer 23(5), pp. 23-33 (May 1990).

39.

T. Rühl and H.E. Bal, ‘‘Optimizing Atomic Functions using Compile-Time Information,’’ Working conference on Massively Parallel Programming Models (MPPM-95), Berlin, pp. 68-75 (Oct. 1995).

40.

T. Rühl, H.E. Bal, R. Bhoedjang, K. Langendoen, and G. Benson, ‘‘Experience with a Portability Layer for Implementing Parallel Programming Systems,’’ Int. Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’96), Sunnyvale, CA (August 1996).

41.

D.J. Scales and M.S. Lam, ‘‘The Design and Evaluation of a Shared Object System for Distributed Memory Machines,’’ Proc. First USENIX Symp. on Operating System Design and Implementation, pp. 101-114 (Nov. 1994).

42.

J.P. Singh, W-D. Weber, and A. Gupta, ‘‘SPLASH: Stanford Parallel Applications for Shared Memory,’’ ACM Comp. Arch. News 20(1), pp. 5-44 (March 1992).

43.

V.S. Sunderam, ‘‘PVM: A Framework for Parallel Distributed Computing,’’ Concurrency: Practice & Experience 2(4), pp. 315-339 (Dec. 1990).

44.

A.S. Tanenbaum, R. van Renesse, H. van Staveren, G. Sharp, S.J. Mullender, A. Jansen, and G. van Rossum, ‘‘Experiences with the Amoeba Distributed Operating System,’’ Commun. ACM

33(12), pp. 46-63 (Dec. 1990). 45.

K. Verstoep, K. Langendoen, and H.E. Bal, ‘‘Efficient Reliable Multicast on Myrinet,’’ 1996 Int. Conference on Parallel Processing, Bloomingdale, IL (August 1996).

46.

D.A. Wallach, W.C. Hsieh, K.L. Johnson, M.F. Kaashoek, and W.E. Weihl., ‘‘Optimistic Active Messages: A Mechanism for Scheduling Communication with Computation,’’ Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’95), Santa Barbara, CA, pp. 217-226 (July 1995).

47.

G.V. Wilson and H.E. Bal, ‘‘An Empirical Assessment of the Usability of Orca Using the Cowichan Problems,’’ IEEE Parallel and Distributed Technology 4(3) (Fall 1996).

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.