A parallel disk storage system for real-time multimedia applications

May 27, 2017 | Autor: Richard Muntz | Categoria: Cognitive Science, Intelligent Systems, Intelligent, Real Time, Storage system, Multimedia Application

Share Embed

Denunciar este link

Descrição do Produto

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2317256

AParallel Disk Storage System for Realtime Multimedia Applications Article in International Journal of Intelligent Systems · December 1998 DOI: 10.1002/(SICI)1098-111X(199812)13:123.0.CO;2-M · Source: CiteSeer

CITATIONS

READS

32

17

3 authors, including: Richard Muntz

Jose Renato Santos

University of California, Los Angeles

HP Inc.

242 PUBLICATIONS 9,795 CITATIONS

53 PUBLICATIONS 1,405 CITATIONS

SEE PROFILE

SEE PROFILE

All content following this page was uploaded by Jose Renato Santos on 09 December 2013. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

A Parallel Disk Storage System for Realtime Multimedia Applications (To appear in International Journal of Intelligent Systems, special issue on Multimedia Computing Systems, 1998)

Richard Muntz, Jose Renato Santos, Steven Bersony UCLA Computer Science Department Abstract We describe the design and implementation of the RIO (Randomized I/O) multimedia object server which manages a set of parallel disks and supports real-time throughput and statistical delay guarantees. This storage subsystem was implemented as part of a multimedia information server which supports multiple concurrent applications such as video on demand and 3D interactive virtual worlds. We discuss the principal issues and innovations involved in the design and implementation of the RIO storage system, and present experimental performance results measured on a prototype implementation. A multimedia data server must be ready to handle a variety of realtime object types (video, audio, 3D interactive virtual worlds, etc.) as well as a non realtime workload component. Achieving simultaneously (1) high utilization and (2) low latency with a high degree of certainty is the challenge. Our prototype system provides a statistical guarantee of quality of service. Our experimental results show that it is possible to achieve a very small probability of exceeding a deadline (less than 10?6), with a relatively high disk utilization (70% to 99%, in terms of fraction of the maximum disk capacity) , together with a relatively small delay bound (on the order of 0.5 sec. to 1.5 sec ), using contemporary disks. These results were achieved using random allocation of disk blocks and replication in conjunction with online load balancing.

1 Introduction 1.1 Motivation Advances in technology, have enabled the increasing use of information systems for storing and retrieving multimedia data, such as images, video, audio, 3D graphics, etc. These new data types have dierent characteristics and require new approaches when designing systems. High bandwidth and large storage This research was supported in part by grants from Sun Microsytems, Intel Corp., and NSF grant IRI-9527178.The second author is also with University of Sao Paulo; and his research is partially supported by a fellowship from CNPq (Brazil) y Now at USC Information Sciences Institute

1

space requirements are among these characteristics. Moreover, continuous media data impose deadlines on the retrieval and delivery of information; namely data objects must be present at the client platform prior to the time they are needed for display. Failure to provide realtime service can result in disruptions and delays which compromise the quality of the presentation observed by the end user. To achieve high quality continuous media playout, such disruptions must be reduced to very low levels. The realtime characteristics of continuous media aect the design of multimedia systems at dierent levels, such as data storage, information management, processor and memory management, data communication, data display, etc. In this paper we concentrate only on storage system issues for multimedia data. The other aspects of multimedia systems are equally important, but are not discussed here. Our assumption is that multimedia objects are too large to t entirely in main memory and need to be retrieved from disk on demand. Further, parallel disks are required for the bandwidth and storage capacities anticipated for multiple clients and for high performance applications. Although video and audio require realtime service their workloads have a certain degree of predictability since a typical playout stream accesses data sequentially. This predictability has been exploited in many multimedia data server designs in which, based on the sequential access pattern, data is carefully layed out on disk drives such that contention for the drives is avoided and realtime guarantees can be made. This approach can work well when the workload is highly predictable. However, in practice several factors reduce the workload predictability (even for video) and makes the problem of optimal data layout very dicult, if possible at all. One of these factors is that video and audio are generally compressed by encoding techniques such as MPEG1 and MPEG2. In order to achieve a constant display quality, these encoding techniques may generate variable bit rate (VBR) media streams which introduces a temporal variability to the I/O pattern. In addition, providing VCR features such as Pause, Fast Forward and Rewind, also reduces the predictability of a stream I/O access pattern. Finally, multi resolution encoding schemes such as are found in the MPEG standards complicate data layout and I/O scheduling [9]. New multimedia applications, such as 3D interactive virtual worlds, have I/O patterns much less predictable than video or audio. In a 3D interactive virtual world application the user navigates through large 3D graphic models at variable speed and along user controlled paths. In order for the display engine to show the current world view, the graphical models of nearby 3D objects need to be continuously retrieved from disk as the user moves. The access pattern to storage objects thus depends on the speeds and paths selected by the user, which makes prediction imperfect at best. 3D virtual world models have been used for dierent applications such as architectural building design[10], urban city models[18], scienti c visualization [19][12], 2

etc.; and will be increasingly common in the future. Because of the diculties in predicting the I/O pattern of multimedia data access, we believe that multimedia data servers will move towards solutions that do not rely on a careful data layout designed to match a predicted pattern of access. Our approach to the problem is to use a random allocation scheme for laying out data on disks that results in a uniformly random access pattern at the physical level regardless of any reference pattern at the logical level1. This approach has the advantage of mapping all workloads and all access patterns of dierent multimedia applications into the same workload at the physical disk access level. Thus we have a single problem to solve: for a stream of independent I/O requests which are uniformly distributed over the parallel disks, design the scheduling algorithm to support the largest possible throughput for a given maximum delay and a given probability of exceeding the delay bound. If we can satisfactorily solve this problem the same storage system can be used for any multimedia application. The RIO multimedia object server is our solution and it is oered as a generic multimedia storage system capable of ecient, concurrent retrieval of many types of media objects. RIO is the storage subsystem for a multimedia information system under development at UCLA called the Virtual World Data Server (VWDS). The VWDS addresses many dierent issues including data storage, memory management, network communication, admission control, trac policing and shaping, and also speci c issues of 3D model virtual worlds, such as spatial indexing, adaptive quality of service, user motion and perception modeling, etc. However, in this paper we focus only on the storage subsystem of the data server, and describe just enough of the other system components to put the storage system in context. The prototype is implemented on a SUN E4000 machine, having 10 Ultrasparc processors, 1.25 Gbytes of shared memory, fourteen 4 Gbyte disks dedicated to multimedia data storage, with ATM and Ethernet connections to client machines. The rst prototype is operational and has been used to simultaneously support delivery of MPEG encoded videos, and 3D urban simulation city models. Other applications for realtime scienti c visualization and medical VR are also under development. A port of the VWDS to a cluster of workstations is also underway. 1 There are a number of proposals for layout of video data on parallel disks. The most common method proposed is to stripe each object across the parallel disks using a xed size stripe granule (i.e., disk block). While allocation of a disk block on a disk is often random, logically consecutive blocks are typically allocated in strictly round-robin order to disks. In RIO we randomly select the disk to hold each data block as well as randomly select the location of the block on the disk.

3

1.2 Paper organization This paper is organized as follows. In section 2 we brie y describe the overall architecture of the VWDS. In section 3 , we discuss the novel aspects of the RIO system and compare its design with that of other approaches. In section 4 we present and discuss the principal experimental performance results obtained from the prototype implementation. In section 5 we extend the performance study introduced in Section 4 using a validated simulation to analyze the system performance when using a larger number of disks than were available in our prototype. Section 5 also presents some analysis of fault tolerance issues. In section 6 we compare the performance of RIO to a striping system optimized for CBR (Constant Bit Rate) video with one playout rate for all videos. The comparison is in a strong sense, a worst case for RIO which is not optimized for this one type of trac and can handle more general workloads. Our results indicate that RIO is at least competitive and can sometimes outperform the striping systems even in this case. Finally, in section 7 we oer our conclusions.

2 The VWDS Data Server Display

Display

USER commands

Video image

Video Client

3D graphics rendering

USER commands

3D Interactive Client

Local Video Buffer Data (video stream)

Telemetry (play, fforward, etc)

Data (granules)

Local Scene Graph

High speed network

Telemetry (user pos, vel, etc)

VWDS Data Server

Figure 1: VWDS Data Server Architecture. In this section we give a brief description of the principal components of the VWDS. Figure 1 illustrates a typical con guration in which client machines access the data server through a high speed network (ATM 4

in the prototype). Users run applications such as video players, 3D virtual world simulations, etc., on client machines that access data stored on the VWDS. Clients periodically send telemetry information to the server, which is used to determine the data objects that need to be sent to the client in the immediate future. For 3D virtual worlds, telemetry information consists of current eye position, direction of movement, speed, etc. The server uses this information, combined with a model of user mobility to generate a list of scene objects (logical subdivisions of data, representing 3D graphical objects geometry and texture) that need to be sent to the client and the earliest time each is needed. In the case of video, telemetry information consists basically of user commands for VCR functionality such as Pause, Fast Forward, and Rewind. Request queue

(n)

Control Information

Network

telemetry 3D VW Session Manager Ack Queue

. . . telemetry

Data

Request queue

Mediator Request queue

Ack Queue

Video Session Manager Ack Queue

(m) Data

Data Buffer

RIO Storage Server

Disks

Figure 2: RIO Data Server Software Architecture. Figure 2 gives a slightly expanded description of the VWDS architecture on the server side. Three major components of the system are identi ed; the RIO storage server, the mediator and the session managers. A session manager is speci c to a type of application. Within a session manager new threads are dynamically created as users start new sessions with the VWDS. The current prototype has session managers for two types of applications; video and 3D virtual worlds. The 3D virtual world session manager is the more complex and is responsible for the logical organization of 3D models in terms of scene objects and it utilizes a spatial index for accessing the virtual world data objects. The 3D session manager predicts the logical scene objects that will be needed on the client in the near future, based on the most recent telemetry information, a motion model of the user, and a description of the viewing frustum. A list of the needed scene objects, with deadlines, are then submitted to the mediator. The video session manager is much simpler and basically generates periodic requests to logically sequential blocks of data of a video object, according to the current temporal position of the data stream. The mediator is responsible for managing the available systems resources such as disk bandwidth, network 5

bandwidth and buer space across all active sessions. The resources are managed by the mediator using a combination of techniques at two time scales. The rst of these is the admission control scheme which determines whether there are sucient resources to accommodate a new session. The decision is based on the resources required for the current sessions and a description of the workload of the proposed new session. A more detailed description of admission control for the storage system is given later. Although not implemented in our current prototype, in future versions the mediator will use alternative versions of data objects to adapt a session's current demand for system resources based on the current competition for resources from all sessions. The Mediator is also responsible for managing the data buer space and for mapping an access to a logical object to a set of physical disk block requests which are then submitted to the RIO storage server. The RIO storage subsystem is responsible for the allocation/deallocation of space for multimedia data objects on the multiple disks of the system and for scheduling disk block accesses to meet the delay bound. This system component is the focus of this paper and is described and analyzed in the following sections.

3 Realtime disk scheduling 3.1 Realtime scheduling of a single disk The general problem of scheduling realtime requests to a single server is well understood and has been widely studied. Liu & Layland [21] showed that for a set of preemptable tasks generating periodic requests, an optimal scheduling policy is the EDF (Earliest Deadline First) algorithm, that schedules requests in the order of their deadlines possibly preempting requests being serviced when a request with an earlier deadline arrives. The EDF algorithm is optimum in the sense that if there exists a feasible schedule where no request misses its deadline then scheduling the requests according to the EDF policy will guarantee that no request misses its deadline. Later, Jeay et. all [17] extended this result and showed that the EDF policy is also optimal for periodic and sporadic tasks which are not preemptable. However, these results assume a work conserving server which is clearly not the case for disk I/O. Serving disk requests ordered by their deadlines (request time + delay bound) can lead to poor performance since this ignores disk seek overhead. The problem of optimizing disk I/O has been extensively studied in the case of non realtime I/O and several scheduling algorithms have been proposed. The most important and well known algorithms are variations of the SCAN algorithm in which requests are processed in the order of their cylinder position 6

on the disk as the disk head moves in the direction from the outermost cylinder to the innermost or viceversa. A realtime scheduling algorithm, must compromise between optimization of the I/O cost and avoiding requests exceeding the delay bound. Thus, a combination of disk I/O optimization techniques with real-time scheduling algorithms is required to achieve the best results. We are not aware of any theoretical work that results in an optimal disk scheduling policy for realtime requests. However it is reasonable to assume that combining traditional disk I/O optimization techniques with real-time scheduling algorithms can lead to better disk utilization than simple realtime scheduling based only on the request deadlines. Combining disk SCAN algorithms with realtime scheduling has also been proposed in other related works. [26][35] Although minimizing rotational latency is also possible, it is more complex and dicult to implement outside the disk controller. We therefore adopted a modi ed bidirectional SCAN or elevator algorithm, called RTSCAN (realtime SCAN), in which the number of requests that can be served in any scan is limited to a maximum value nmax . Limiting the number of requests, allows the duration of a disk cycle (i.e one scan of the disk arm ), to be bounded. The maximum cycle time is important when a delay bound must be guaranteed. We assume that a disk cycle is non preemptable, and requests arriving during one disk cycle are deferred to the next disk cycle. At the beginning of each cycle the queued requests, up to nmax , are scheduled in the next cycle. The requests are ordered according to their cylinder position on the disk surface and then served in the increasing (or decreasing) order of their cylinders position, depending on the direction of the next scan. t req< T +T i

request arrival

i+1

request completion

Ti

T i+1

cycle i

cycle i+1

time

Figure 3: Successful request worst delay. We call a request successful if it is served on the cycle following the one that is active when it arrives, and we call it a delayed request otherwise. Since requests can be served in any order within a given cycle the worst case delay of a successful request happens when the request arrives just after a cycle starts and it is 7

the last request to be served in the following cycle, as is illustrated in Figure 3. Thus, the maximum delay of a successful request is twice the maximum duration of a cycle Tmax . The delay bound guaranteed by the RIO subsystem will be in the neighborhood of 2Tmax but will ultimately have to be determined empirically.

3.2 Multiple Disks Continuous media data typically requires large amounts of space and high bandwidth. Current commercial disks, having an average data rate of 30 Mbps to 100 Mbps, can support a number of, say, MPEG1 videos (typically 1.5 Mbps) or MPEG2 videos (typically 6-10 Mbps). However, to support a large number of concurrent users as well to as support media objects with higher bandwidth requirements such as High De nition TV, 3D virtual worlds, etc., parallel disk systems are required. In order to eectively use the available disk bandwidth in a parallel disk system, we must ensure that the data requests are evenly distributed among all disks. We should avoid the situation where a signi cant fraction of requests are concentrated on a few disks even for a short time, and ideally we would like to have all the load evenly distributed among all disks at all times, such that we could use the full bandwidth of all disks and such that any arriving request will have a relatively short waiting time. This problem of load balancing is addressed in dierent ways in dierent systems. (Even within the same system, load balancing is often approached with dierent methods on dierent time scales, e.g., in storage systems both storage allocation and request scheduling are components of the load balancing solution but are applicable on widely dierent time scales.) Much of the work on continuous media servers has concentrated on video playback which has the convenient property of being highly predictable. Once a video playback stream is started the video object is sequentially read from the storage system. This high predictability is exploited in some video servers to carefully layout data on disks to achieve a good load balance of the data requests to the systems disks. A common data layout scheme is striping which is round robin allocation of blocks to disks, where logically consecutive blocks of an object are allocated on consecutive disks, cycling repeatedly over all disks. Then, accessing the object sequentially, will generate successive accesses which cycle, in round robin fashion, through all disks. The great advantage of striping is that it decouples storage capacity from disk bandwidth, since each object uses the bandwidth of all disks and thus disk overloading due to skew in object popularity is not an issue.[8] [31] [15] [33] A common approach to scheduling requests when objects are striped over all disks, is to process requests 8

in cycles of constant duration synchronized across all disks. In each cycle, each active stream accesses a single block on a particular disk. On each cycle the data read during the previous cycle is played out to the client while the next logical block is read (which is located on the next disk in round robin order). Thus, the load on each disk (i.e., the number of requests in the current cycle), moves to the next disk in round robin order on the next cycle. Note that as long as the streams move in \lock step" there is no danger of some disk having an overload in any cycle. Also, if there is bandwidth available in the system for another stream, this means that there is some disk that has less than the maximum number of requests that can be served in a cycle. Thus \available slots" migrate from disk to disk as cycles are processed. At some point, an available slot appears at the disk that holds the rst data block of a waiting stream. The waiting stream can then start and will be guaranteed to have one request per cycle for the duration of the video. However there are some drawbacks in this approach. First, this scheme, as described, requires that all streams have the same constant bandwidth. This limits the use of this type of system in supporting dierent types of media objects such as MPEG1 and MPEG2 video, sound, etc. Also a single video may have a variable bit rate (VBR) when encoded using MPEG or other similar compression techniques. Schemes using variable size blocks [6] or a variable number of blocks per cycle per stream [32] have been proposed but, due to space limitations we are not able to discuss them in any detail. How to provide VCR functionality on this type of system is also an issue. Finally, the same number of block accesses can have dierent durations due to disk seek times and rotation latency variations as well as transmission speed variations from zone to zone. Therefore in order to have a tight upper bound on buer use, the length of the disk cycle needs to be prespeci ed and cycles are synchronized on all disks with the cycle length set to the worst case cycle time for some number nmax disk accesses. However most disks will be idle toward the end of a cycle which reduces the eective bandwidth of the system. This scheme results in a system which saturates at some load which is less than the capacity of a system which does not impose such idle times. The implication is that, at high loads, the system that does not have the synchronization imposed idle times will perform better. A detailed study of this phenomenon is given in Section 6.

3.2.1 Random Data Allocation In the last section we discussed current real-time data servers using disk striping over multiple disks and we pointed out some drawbacks of these approaches. We now propose a new data layout scheme. In this approach a storage object is an array of xed size blocks. The data blocks of an object are randomly mapped 9

to disks blocks. Random block allocation naturally provides long term load balancing; after a sucient large number of requests uniformly distributed over the disks, by the Law of Large Numbers the number of requests submitted to each disk will be nearly equal. However, some short term statistical variations can still cause short term load imbalance, which can lead to a relatively heavy tailed response time distribution with high variance. This forces us to reduce the maximum average load that can be submitted to the system to provide slack time that can absorb the short term uctuations. Our solution to control these short term variations is to use partial replication and an on-line, shortest-queue scheduling algorithm as discussed later in this section. Random data allocation has been proposed for non real-time system such as RAMA parallel le system proposed by Miller and Katz [22]. However, random data allocation is not considered in almost any real-time system for multimedia applications , with the exception of the work of Tewari et all [30]. They propose a video server with random placement of data blocks on disks. However, they do not consider data block replication and do not address the problem of short range load imbalance2. At this point we enumerate several claimed advantages of the random data allocation used in the RIO storage subsystem which will be substantiated in the following sections. 1. Support for unpredictable access patterns For applications with unpredictable access patterns such as 3D interactive virtual worlds, careful data layout is impracticable and random allocation seems the appropriate solution. All access patterns at the logical level are mapped to random accesses at the physical level. So from the point of view of realtime performance of the disk subsystem, there is only one workload to deal with. 2. Robustness with respect to data hot spots It is known that in most storage systems a small fraction of the data is the target of a large fraction of the I/O requests. The more popular data, usually called hot spots, are present in most systems with various degrees of intensity. In general this can lead to some load imbalance if the hot spots were concentrated on a few disks of the system. For example, if we distribute the data on disks based on logical objects and most of the I/O requests access a few objects located on a few disks of the system, these disks will have a higher load than the others. In our system, however, the hot spot data will be 2 Yitzahak Birk informed us through personal communication that his group is also working on video servers based on random data allocation at the Technion in Israel, but their work is still unpublished and unavailable at this time.

10

mapped to a set of hot physical blocks randomly distributed over the disks. The location of each hot physical block is independent of the location of any other hot physical block, even if they belong to the same popular object. Therefore, if the number of hot data blocks is sucient large (more than a few blocks per disk), which is probably the case for most applications, they will be approximately equally distributed among the disks. In the improbable case that the number of hot blocks is very small, that could lead to a skewed load. However this skewed load could be eectively removed using a simple disk cache with the size of the cache only a few blocks per disk. We note also that the data that is \hot" frequently changes with time. To adequately deal with time dependent access frequency is another level of diculty which the random allocation approach eliminates (assuming only that there are many more hot blocks than there are disks). 3. Asynchronous cycles Since requests are not carefully laid out on disks, there is no need to have synchronized disk cycles on the system disks. This improves the performance of the system, since disks do not need to be idle, waiting for other disks to nish their cycles. 4. Multiple playout rates and Variable bit rate Since no careful layout or time synchronization among the disks is assumed, support for streams with dierent playout rates are naturally provided. Also, changing the access rate of any stream, which is required by VBR video encoding for example, depends only on the total available bandwidth of the system and not on the temporal pattern of available bandwidth of individual disks as is the case when for example staggered striping is used. For the same reason, providing VCR functionality is much easier. 5. Support for dynamic quality of service When the current (transient) load exceeds the storage system's capacity, applications have to adapt their resource requirements. The result is some loss of quality of service during the transient overload. Generally this is accomplished by use of alternative representations of objects which have less resolution and are smaller. This also implies a dierent access pattern to the data depending on whether there is a transient overload condition or not. This can be complex to deal with eectively [9] in a striping system but it poses no special problem to RIO since the pattern of accesses continues to be random. Random data allocation, on the other hand can not provide absolute realtime guarantees. Although 11

random allocation will distribute data requests equally among all disks in the long term, there will also be short term skew in the trac that will unequally load the disks. In a non realtime application such short term

uctuations in load are generally not a problem. We have adopted the approach of statistical guarantees, and guarantee that the probability of missing the delay bound is smaller than , where can be made very small, e.g., 10?6. Another limitation of random allocation is that the system can not be loaded at full disk utilization, i.e. at the average eective data rate that could be supported by the system disks, since some room for statistical variation should be provided. However, our experimental results, presented later, show that relatively high disk utilization with small probabilities of missing the delay bound are possible.

3.2.2 Statistical Delay Guarantee with Random Block Allocation We start with a simpli ed description which is useful to the parameters and algorithms are involved. Then we will extend the description to include a few additional details.

Simpli ed system model First we assume an admission control module which limits the initiation of sessions to guarantee that certain constraints on the disk trac are met. In particular admission control will guarantee that the total number of disk accesses in any T length interval is less than N, where T and N are system parameters that will be de ned shortly. Let D be the number of disks. Then n = ND is the number of accesses to each disk assuming that N is an integer multiple of D and assuming that the total number of accesses is divided equally among the disks, e.g., if they could be routed round-robin to the disks. Of course round-robin scheduling is not possible (unless all data is replicated on all disks) but as an infeasible ideal it is interesting to consider. In this ideal case the number of requests to each disk in any T length interval will be at most n. Now assume each disk is scheduled using the RTSCAN algorithm with at most n accesses served per cycle. Let Tmax be the maximum time for a cycle, i.e., the maximum time to serve n disk accesses. Then, as discussed previously, the maximum delay any request will experience is 2Tmax .

Realistic system model When we consider the actual system there are additional details that have to be considered. First, with respect to admission control, we have to consider the manner in which session trac is characterized and how 12

that characterization can be used to make the guarantee that no more than N disk accesses will be required in any T length interval. The H-BIND model [20] for example leads to bounds on the mean and variance of the number of accesses for a given interval length. Then using the Central Limit Theorem or Cramer's bound [5] one can estimate the probability that the actual number of requests in a T length interval will exceed a given number, N. We have adopted this method for admission control. A second modi cation to the system model concerns the distribution of disk accesses to the individual disks. The simple model made the unrealistic assumption that the requests are perfectly distributed to the disks. In fact, if disk blocks are not replicated, there will be no choice in routing requests to disks. Given the random allocation policy, the correct model is that each request is equally likely to go to any disk; thus, in fact, there is very low probability of such a perfect division of the requests to disks. In order to be able to provide a delay bound with high probability, there are several choices that come to mind immediately: either increase the delay bound or reduce the maximum trac allowed by admission control, i.e., reduce N. However reducing N implies less sessions can be served concurrently. There is a third interesting option: replicate some data blocks so that for this fraction of the requests there is a choice as to which disk to send the request. One simple online algorithm is to send the request (for a replicated block) to the disk with the smaller number of un nished requests. (More sophisticated schemes which estimate the actual service times, etc. were studied and found to yield little improvement over this simple scheme.) The use of replication is partially motivated by the technological trends in magnetic disks which suggests that storage capacity is increasing faster than access rates or transmission rates [14]. Thus bandwidth, rather than storage capacity, will increasingly be the more critical resource. The routing of requests for replicated blocks to the least loaded disks is easily mapped to a problem that has been studied in the computer science literature for load balancing in distributed systems [11]. There it is often referred to as \random probing". Rather than disk accesses and disks, the problem is normally expressed in terms of tasks and processing nodes. A task may arrive to the system and, due to the expense of communication with processing nodes, several processing nodes are randomly chosen to probe for their current load. The task is then routed to the least lightly loaded of the probed nodes. (There are many variations which are not relevant here, e.g., including the cost of moving the task to a remote processing node.) The main result is that random probing is very eective and does not require probing of many locations to be eective; most of the bene t can be realized by only probing two sites and, in fact, if only a fraction of the requests are routed this way (the remainder go to randomly chosen sites) most of the gain is realized. The mapping of this problem to our parallel disk subsystem is clear. Disk requests are the 13

tasks and random disk block allocation with random replication is equivalent to randomly probing for the least loaded server. The most recent work related to this problem is that of Mitzenmacher [23] which has a detailed analysis of several variations of a multi-server queuing model in which customers can randomly probe d servers and join the one with the shortest queue. The results show that an exponential improvement in the mean waiting time of customers is obtained in the system when multiple choices are provided as compared to the case where customers have only one choice. In particular, most of the improvement is obtained when customers only have two choices. Our experimental results con rm that replicating data on multiple disks can signi cantly reduce the variation of individual disks loads, providing a more balanced and ecient system. However there are signi cant dierences between our system and the models analyzed in [23] with respect to service time distributions, arrival processes, etc. Since we require accuracy on the tail of the response time distribution we resorted to simulation and measurement. Models, however, are useful for con guration planning and a rst order approximation of the delay bounds that can be guaranteed. Before leaving this topic of load balancing it is important to make several points. The rst is that most of the work on random probing for load balancing has not been concerned with delay bounds and so concentrated on average latency. We are interested in the tail of the response time distribution because we need to make (statistical) delay bound guarantees. It is intuitive that these bounds will be much better due to the load balancing but to quantify this we have to go to simulation; the analytic results are not available except for limited service time distributions and arrival processes. The second point is a bit subtle. In the simple system model we assumed that disk requests could be perfectly divided among the disks in each time period. If we route disk requests to the shorter queue we can actually do better than this \perfect" distribution of requests! The reason is that routing to the shorter queue tends to balance the aggregate service time assigned to each disk; not just the number of accesses. This has the following eect (which will be illustrated with our measurement and simulation results shortly). The constraint on the aggregate number N, of disk requests in any T length interval, is set such that the average time for a scan cycle with ND requests is close to T and not the maximum cycle time. Thus the constraints that admission control sets on concurrent sessions are more close to the capacity of the system and not based on worst case service times.

Selecting system parameters The values of important system parameters are chosen as follows. First, the disk block size is chosen as a tradeo between disk eciency (larger is better) and ability to avoid fetching super uous data (smaller is better). Let nmax be the maximum number of disk accesses served in one cycle. nmax determines Tmax 14

which is chosen such that there is less than 10?6 probability of two consecutive cycles of nmax requests taking longer than Tmax to serve. The value for nmax is a tradeo between reasonable disk eciency (the larger max the better but with diminishing returns) and response time bounds (the smaller the better). Note that nTmax might appear to be the rate at which requests can be served by a single disk and the delay bound still be met with high probability. Actually, due to the online load balancing the maximum rate can be much closer to nTmax avg where Tavg is the average cycle time to serve nmax requests. Admission control constrains session admission such that the maximum number of requests in any Tavg interval is at most N where N is some value close to Dnmax . The value for N is the most dicult to obtain. It can be roughly estimated from analytic models, estimated (conservatively) within a few percent from simulation models and accurately measured on the prototype system. The analytic model ball-park gure is Dnmax . The simulation and measurement results are presented in the remaining sections of the paper.

Cost of replication Data replication is not new and has been used in many data storage systems, both for providing dynamic load balancing and also for providing reliability under disk failure. However we believe that the use of partial data replication for load balancing on a system for realtime I/O with random data allocation has not been explored in any other work. In RIO a fraction , (0 1) of the total number of stored blocks is replicated while the others are not. In the remainder of the paper we study the system performance for dierent values of , varying from 0 (no replication) to 1 (full replication). The cost, for having better load balancing and increased disk bandwidth utilization is the use of extra storage space. Current disk technology trends, shows that storage capacity is growing faster than bandwidth[14]. We thus believe that disk bandwidth will be the bottleneck for multimedia applications, specially if a large number of streams (users) needs to be supported. If we replicate a percentage of data, one may argue that by buying a percentage of more disks the disk utilization (with same load) is decreased and the delay bound will be met with higher probability due to lower utilization. One response could be that the replication may not require buying more disks; it depends on the relative requirement for storage versus bandwidth. We might just have excess storage capacity, i.e., the disks were bought for their bandwidth. If however we do have to buy more disks to provide storage for replication, our scheme allows us to also use the additional disk bandwidth to support a higher number of active streams (users). For example, suppose we have 20 disks with no replication. To achieve reasonable delay bound say we can only use 75% of the aggregate disk bandwidth and there is no space available for replication. It can happen that with 25% replication (5 more 15

disks) we can utilize 90% of the aggregate disk bandwidth and still meet delay bounds. Thus there would be a 50% increase in available bandwidth and 50% more concurrent streams allowed on average.

3.3 Pseudo random block mapping In RIO a data object is a sequence of constant size data blocks. The size of a block is a system parameter and is speci ed at the time the storage system is created. Typical values for block sizes used in our performance studies were in the range from 16Kbytes to 256Kbytes. The storage map for each object holds a system wide unique identi er which maps to the storage for that data block. To approximate random allocation the following scheme is used. A pseudo random pattern for N logical blocks is constructed which provides a map from logical block number to one or two (if the logical block is replicated) physical disk block numbers (disk number, block number). This map has the following properties: (a) the same number of blocks are assigned to each disk, (b) the desired percentage of blocks are assigned two physical locations, (c) the pattern of replicated block is such that each disk has the same number of replicated blocks. Let L be the number of disk blocks occupied on each disk in the mapping described above. In eect, the pattern described by this table is assumed to be repeated until all the space on the disks is occupied. So the full mapping from a system logical block number to physical location is as follows. Let X be the logical block number, x = X mod N, and let Base = L b XN c. The table is used to map x to a [disk number, block number], (or a pair of such). The disk number identi es the disk, the actual physical block on the disk is at Base plus the block number from the mapping table. When a disk block is to be allocated, a random data block index in the range [0; M] is chosen where M is the total number of logical blocks available, e.g., if we have 100% replication M is half of the number of physical blocks. If this block is free it is allocated to the object. If not, consecutive blocks after it are checked until a free block is found and allocated to the object. This block allocation scheme guarantees that the physical location of any two blocks of an object are independent and uniformly distributed.

4 Performance experiments The major performance issues are to determine the values for block size, nmax , Tmax , the delay bound, the maximum load allowed by admission control, the fraction of replication, and the probability of exceeding the delay bound. These issues are studied in this and the following sections. Here we report on measurements 16

made on the RIO system prototype with 14 parallel disks. In the following sections we use simulation to study larger systems, approaches to fault tolerance and other issues.

4.1 Cycle Time Distribution The duration of a disk cycle is the sum of the I/O time of each request served in that cycle. The time to read a data block is a function of the block size and also of other random components such as rotational delay, the disk transfer rate (which on current commercial disks that implement disk zoning decreases from the outer most tracks to the inner most ones), as well as other minor factors. We analyze the cycle time as a random variable which is the sum of the random variables associated with the individual requests in the cycle, using parameters for the disks used in our prototype (SEAGATE Barracuda ST15150W) The time to read one block is the sum of the following random variables: seek time, rotational latency, disk transfer time, track switch time, SCSI data transfer time , system call and SCSI command overhead, and SCSI bus contention among multiple disks. These parameters were obtained experimentally as in [27] [34]. The disk layout, including number of zones, number of heads, sector size, number of cylinders per zone, track size per zone, and track skew factor per zone, were obtained from information on the disk(s) using SCSI commands. For our purposes we want a stochastic upper bound on the cycle time distribution. So for simplicity, the seek time is modeled by assuming the worst case cycle in which the number of cylinders between the rst block and the last block requested in the cycle is maximum and the other requested blocks in the cycle are equally spaced on the disk cylinders. Then the seek time is computed using the seek time function of the disk (which was obtained experimentally). The rotational latency is computed as a uniform distributed random variable in the range of 0 to the disk revolution time. The disk transfer time is computed using the disk transfer rate for each zone, the block size and the size of each zone (which determines the probability of a block being located at a speci c zone). The track switch time can be obtained by the distribution of the number of track boundaries crossed by a data block, the track skew factor and the disk rotation speed. System call and SCSI command overheads are assumed to be constant and are obtained by measuring the I/O time to read data stored in the disk cache memory for dierent sizes of data and eliminating the linear component that is due to SCSI data transfer time. The SCSI data transfer time is dicult to compute when data is read from the disk, without knowing the details of the disk controller operation, since SCSI data transfer overlaps disk transfers (when large blocks are accessed).We computed the additional overhead due 17

to SCSI data transfer time, by measuring the I/O time for requests for a given block size and discounting the other components which had been previously computed. Up to this point all parameters were obtained from experiments with only one disk on the SCSI bus being accessed. Since in our prototype there are two disks on each SCSI bus, there is an additional overhead due to contention on the SCSI bus. We estimated the distribution of the overhead due to SCSI bus contention by measuring the request I/O time using just one and then using the two disks on the SCSI bus, and computing the dierence of the observed means and variances in the two cases. After computing the request time distribution, the cycle time distribution was computed numerically as the sum of i.i.d (independent identically distributed) random variables.

Figure 4: Cycle time distribution Figure 4 shows the computed and measured cycle time distribution for 128 Kbytes requests and cycles with 20 and 5 requests, for the disks used in our prototype. The computed distribution is shifted slightly to the right and represents a random variable which is stochastically greater than the actual cycle time. This occurs mostly because in our model we assumed the worst case seek time. This error is relatively higher on smaller cycles, since in this case the dierence of the expected seek distance and the maximum is higher. However the error due to this assumption is small and conservative which is what we need.

18

4.2 Calculating the delay bound, Tmax The duration of two consecutive cycles Ti and Ti+1 depends on the number of requests served in these cycles, but the sum is stochastically less than the sum when the two consecutive cycles have the maximum number of requests nmax : Ti + Ti+1 st fTi + Ti+1 jni = nmax ; ni+1 = nmax g where: x st y : x is stochastically smaller than y, i.e. Prob(x > a) Prob(y > a), for all a. ni : Number of requests in cycle i. Because of the random allocation scheme, the location of blocks requested in cycle i + 1 is independent of the location of blocks requested in cycle i, and thus the duration of two consecutive cycles, given the two cycles have the maximum number of requests, are independent random variables. Therefore the distribution of the total time of two consecutive cycles with the maximumnumber of requests fTi +Ti+1 jni = nmax ; ni+1 = nmax g can be obtained by convolving the distribution of the cycle time fT jn = nmax g with itself, where the cycle time distribution is computed as described in section 4.1 and illustrated in gure 4. We then de ne the delay bound \guaranteed" by the system as Tmax = ftjProb(fTi + Ti+1 > tjni = nmax ; ni+1 = nmax g) = Pmiss g Tmax is a stochastic upper bound for the maximum delay of a request that will be served in the cycle following its arrival. We used the above methodology to compute the delay bound for the disks used in our prototype. Unless otherwise stated we consistently use Pmiss = 10?6. When computing the cycle time distribution thus far, we have ignored thermal re-calibration. Disk heads need to be re-calibrated periodically due to thermal variations, and during thermal re-calibration data access is delayed. The smallest interval between two consecutive thermal calibrations for our disks is approximately 90 seconds, and since request delays are of the order of a hundred milliseconds, each request will be delayed by at most one thermal calibration. The fraction of requests aected by re-calibration is much larger than 10?6 and can not be ignored with respect to the delay bound; however, it does not have a signi cant eect on 19

throughput. To account for possible thermal calibrations we add to Tmax an additional worst case thermal calibration time Tcalmax as speci ed in the disk manual (for our disks Tcalmax = 50msec:).

Figure 5: Maximum delay guarantee Figure 5 shows the delay bound, for dierent values of block and cycle sizes, for the disks used on our prototype. We observe that the delay bound increases approximately linearly with the cycle and block size. This behavior is explained as follows.

a) Delay bound as function of block size

The average I/O time for a single request increases linearly with the block size since it is given by the average disk transfer rate times the amount of data accessed plus an overhead time which is approximately constant. The variance of the request I/O time has two major components: the variation of the disk transfer rate due to dierent disk zones and the variation of the I/O overhead time (rotational latency, seek time, etc). Since these components are independent random variables, the variance of the service time of a single request is given by the sum of two variances (one for each component). The variance of the data transfer time is proportional to the square of the block size (standard deviation is a linear function of the data transfer time) while the variance of the I/O overhead time is approximately constant for dierent block sizes. Therefore the standard deviation of the request I/O time increases linearly with the block size, and thus so does the cycle time standard deviation. Therefore, the delay bound, which we expect to be closely related to an oset from the mean cycle time proportional to the standard deviation, should increase linearly with the block 20

size.

b) Delay bound as function of cycle size

The average cycle time is proportional to the number of requests in a cycle. However the standard deviation increases proportionally to the square root of the number of requests in a cycle, since the service time of each request is an independent random variable. However the standard deviation is small when compared to the mean cycle time and the linear behavior of the mean cycle time dominates the delay bound and this explains the almost linear results in Figure 5 (e.g. for 128Kbytes blocks the standard deviation varies from 5.2% to 2.7 % of the mean when the cycle size varies from 5 to 20).

4.3 Eective disk bandwidth

Figure 6: Disk eective bandwidth We compare the eective bandwidth per disk for dierent block sizes and maximum number of requests per cycle (cycle size) . Clearly the eective disk bandwidth increases with the block size and the maximum number of requests per cycle. A larger block size amortizes the rotational latency and overhead times over a larger eective data transfer time and a larger number of requests per cycle amortizes the seek time over more requests. However increasing the block size increases the size of the buers required for each stream which has an associated cost. Also, having larger block sizes increases the probability of reading super uous data for applications such as 3D interactive models where data accesses are not sequential and the model 21

objects will not generally be a multiple of the block size. Another eect is that increasing the number of requests per cycle or the block size increases the upper bound on delay that can be guaranteed with a xed probability. Therefore, there are multiple tradeos when selecting the block and cycle sizes, which have to be kept in mind. However in this section we concentrate only on how the bandwidth of the system varies as we vary these parameters. The eective disk bandwidth for a particular block and cycle size is easily computed as the ratio of the total amount of data read and the average duration of a cycle. Figure 6 shows the eective disk bandwidth achieved by dierent combinations of block and cycle sizes as measured on our system prototype. As expected, the performance improves when either the block size or cycle size increase. We observe in the gure that bandwidth is more sensitive to block size than cycle size. This is due to the fact that the seek time is a small fraction of the total I/O time. We note that these results apply more widely than our random data allocation approach. Any system that schedules data requests using disk cycles, will be bound by the same maximum performance as a function of the block and cycle sizes. These curves are the same for random data allocation or data striping in which blocks assigned to a disk are randomly located on that disk. The maximum allowed bandwidth per disk (enforced by admission control) will actually be a fraction of the maximum eective disk bandwidth shown in Figure 6, since we have an additional constraint on the delay experienced by requests. To meet this requirement we have to constrain the trac presented to the storage system to something less than the maximum values that appear in Figure 6. Data striping systems also have to operate at bandwidths smaller than those shown in Figure 6, but we will address this later in detail in Section 6 in which we compare the performance of the two approaches.

4.4 Probability of response time exceeding the delay bound The delay bound is chosen so that the probability of any request response time exceeding this value is smaller than the probability Pmiss when there are no more than nmax requests per cycle for a disk. However, there is still the the possibility that more than nmax requests are routed to a particular disk in some interval of length Tmax . The goal is for the probability of this happening to any request also be less than 10?6. We call a request that arrives to a disk queue and nds nmax requests (or more) already there, a \delayed" request since it will not be served in the next scan cycle. Of course, as we increase the total load on the system, we expect the probability of having delayed requests 22

to increase, since we are increasing the average load on each disk, and thus increasing the average disk queue size. Another factor that determines the probability of having over ow cycles (i.e. cycles initiated when there are more than nmax requests waiting) is the level of load balance among the system disks. Due to the randomness of the location of data blocks, the instantaneous load on each disk will uctuate around its mean value. The higher this uctuation the higher the probability of having a transient high load on at least one disk and thus the higher the probability of having an over ow cycle, and therefore a delayed request, for the same average workload. In order to analyze the performance of our system in detail we conducted a series of experiments on the prototype system. During each experiment the system was driven by a trac simulator, generating a sequence of random read requests of constant size blocks for a period of 30 minutes. The simulator generates a xed number of requests in successive intervals of constant duration Tmax with the requests arrival times randomly distributed over the interval. Although not explicitly shown here, we also experimented with dierent intervals lengths and with other arrival processes (Poisson and constant rate) and the results obtained were indistinguishable from each other so we concluded that the results are relatively insensitive to the exact nature of the arrival process. In all experiments we assumed that the delay bound was the same for all requests. (This value is dierent for each block size and cycle size selected). The experiments were carried out on a con guration with 14 disks. Figure 7 shows the measured frequency of requests exceeding the delay bound as function of the relative load. (The relative load is de ned as the fraction of the maximum eective disk bandwidth that can be achieved for the selected block and cycle sizes.) Results for the 100% replication case are not shown in the gures because in this case the fraction of requests exceeding the delay bound is smaller than 10?6 for the entire range of loads from 0% to 99% of the eective disk bandwidth and could not be measured. (The duration of experiments did not allow measurements of frequencies smaller than 10?6). This demonstrates the startling eectiveness of short term load balancing via replication. In all experiments, no successful request missed the delay bound and of the requests exceeding the delay bound, all were due to delayed requests, con rming our prediction that the probability of a successful request response time exceeding the delay bound is less than 10?6. The results show that increasing the level of replication reduces the probability of exceeding the delay bound, due to enhanced short term balancing of disk queues. We also observe that the relative performance for 32 Kbytes and 128 Kbytes are very similar, although the absolute performance is dierent due to dierent 23

Figure 7: Probability of a request response time exceeding the delay bound. 24

eective disk bandwidths for dierent block sizes. We also observe that increasing the cycle size, reduces the probability of exceeding the delay bound. The main reason is that increasing the cycle time causes a signi cant increase in the delay bound used in the experiments. Regardless of nmax the number of requests actually served per cycle tends to be determined by the trac rate (see discussion on distribution of cycle sizes below). Therefore the distribution of response time for a request is not so sensitive to nmax , but the delay bound does increase with nmax . The results show that probabilities less than 10?6 of exceeding the delay bound can be achieved with loads in the range of 70% to 95% of the maximum load supported by the disk with only 25% of the data replicated; or with loads on the order of 99% when all data is replicated. Even with no replication loads in the range of 60% to 80% can be achieved. (for delay bounds in the range of 0.3 sec to 1.5 sec.). The eect of replication can be observed in Figure 8 which shows the density of cycle sizes for 128 Kbytes blocks, maximum cycle size 5 and 20 and relative load of 80% and 96% (Note that the scale in the graphs is logarithmic which explains why the areas under the curves are not the same). We observe that increasing the level of replication signi cantly reduces the probability of having cycles of the maximum size (or over ow cycles) and thus reduces the probability of exceeding the delay bound, as observed in our previous results. Figure 8 also shows that the mean cycle time is often much smaller than the maximum cycle size. This occurs because of the asynchronous nature of cycles. The extra disk bandwidth available, which otherwise would generate idle time, is absorbed by having more frequent but smaller cycles. (The disk tends to be as inecient as it can, by having shorter cycles and more seek overhead, for a given load in order to absorb the extra bandwidth.) This eect is more pronounced for larger cycles. For example, in the case of nmax = 20 the maximum observed cycle time is smaller than the nominal maximum cycle. Even a load slightly smaller than the maximum, for example with 96% utilization, reduces the mean cycle size by approximately a factor of three. To emphasize this behavior, consider the case of 128 Kbytes blocks, maximumcycle with 20 requests and 96% load. From Figure 6 the eective disk bandwidth in this case is 3.72 Mbytes/sec and thus a 96% load is equal to 3.57 Mbytes/sec/disk. Again from Figure 6 we observe that an eective disk bandwidth approximately equal to 3.57 Mbytes/sec is achieved by a cycle with 7 requests, which, not surprisingly, is approximately the mean cycle time observed in Figure 8. Thus the disk is operating with less ecient cycles and very little idle time in this case. This absorption of excess bandwidth has however a limit, since we can not decrease the cycle size to a value smaller than 1. When we reach this limit, most cycles will have size 1, and the disk will start having more frequent idle periods, as is suggested by Figure 8. 25

Figure 8: Cycle size distribution

4.5 Delay bound versus allowed load The last section showed how the probability of exceeding the delay bound varies as a function of the relative load, level of replication, block size and cycle size. Here we measure the distribution of request delays and determine for each block size and cycle size what delay bound guarantee can be made with a desired probability (which we again set to be 10?6). Figure 9 shows the absolute load that the system can support as a function of the delay bound. We observe that larger block sizes provide better disk bandwidth utilization simply because the disk is used more eciently. Of course higher levels of replication improve the performance of the system due to better load balancing. We also observe that large cycles allow the system to achieve 26

Figure 9: Maximum load as function of delay bound slightly higher loads when longer delay bounds are used, due to the slightly increased eective disk bandwidth.

5 Simulation Results In the last section we discussed measurement results obtained from the prototype system. In order to evaluate the performance of the storage system for a larger number of disks than are currently available in the prototype, a simulator was developed and validated. In this section performance results obtained using the simulator are described. In order to validate our simulator we compare the results generated by simulation with the experimental results obtained in the last section. Clearly the results of the simulation track very closely the measured results in gure 10, con rming that our simulation is modeling the system performance and disk behavior accurately.

5.1 Eect of the number of disks Figure 11 shows the fraction of requests that miss the delay bound for dierent numbers of disks, with block size 128 Kbytes and various fractions of replication and cycle size. We observe that, over a fairly wide range, the system performance is not very sensitive to the number of disks given the same relative load and replication level. However, when the number of disks is relatively small, the probability of missing the delay 27

Figure 10: Simulation validation experiments. bound is slightly smaller than for larger number of disks. This can be explained as follows. The trac generator generates a xed number of requests in a speci ed time interval T, that is proportional to the number of disks of the system, thus generating more requests for systems with higher number of disks, for the same relative load. The total number of requests in this interval, for a system with D disks, is denoted by

D, where is a constant proportional to the relative load, eective disk bandwidth and interval duration. Consider a system with no replication. Then, the number of requests Nn in an interval that are directed to a particular disk is a random variable with a Bernoulli distribution with number of trials n = D, probability of success p = 1=D and probability of failure q = (D ? 1)=D. The mean number of successes of a Bernoulli process with n trials is given by E[Nn] = np and the variance is given by V ar(Nn ) = npq. Therefore the mean and variance of the number of requests directed to a particular disk in an interval is given by E[Nn] = and V ar(Nn ) = (D ? 1)=D, respectively. We conclude that the variance on the number of requests in the disk queue is smaller when the number of disks is smaller, reducing the probability of having over ow cycles and thus reducing the probability of missing the delay bound, explaining the results of gure 11. This also explains the observation that as the number of disks become larger, the variation of the system performance becomes less sensitive to the number of disks, since the variance of the number of requests directed to a disk approaches . These comments are also valid when replication is used, except that the variance in all cases is reduced due to load balancing. 28

Figure 11: Eect of the number of disks

5.2 Fault tolerance The use of partial replication as described in the last sections is an eective method for providing short term load balance among the parallel disks. However in order to provide tolerance to disk failures we need to consider alternative schemes. There are basically two approaches; one is to use full replication and the other is to use parity groups such as used in RAID schemes[25]. One possibility is to use RAID subsystems to provide fault tolerance and partial replication of data blocks to provide load balance. In this approach, the total number D of disks in the system are divided into NA disk arrays with DA = D=NA disks each. Each disk array is a RAID level 5 [25] with parity group declustering [24] that stores data on parity groups of size G distributed on G dierent disks (G ? 1 data blocks and 1 parity block; G DA ). A fraction of the data blocks is then randomly replicated across all disks of the system. Replicating data blocks across all disks, instead of replicating them only within the same disk array allows balancing the load among disks in the disk array as well as across disk arrays. Parity blocks are not replicated, since they are only used during disk failure (for data reading), while data blocks are used in all situations. The ratio of useful data space and total storage space for this scheme is given by: U=

1 1 + + G?1 1 29

C disks

C disks

CLUSTER 1

CLUSTER 3

CLUSTER 2D/C 1

2

CLUSTER 2

C disks

...

...

3

CLUSTER 2D/C -1 CLUSTER 2D/C-2

CLUSTER 2D/C D-1

D

C/2 disks D disks

Figure 12: Replication using overlapping clusters Observe that this scheme has performance identical to the case that does not use RAID, when the system is operating in normal conditions (without failure) and with only reading operations, since parity groups are only used if a disk fails. Later, after we introduce alternative schemes we analyze performance when a disk fails. Another scheme is to use 100% replication for both load balancing and fault tolerance purposes. Since all data blocks are replicated, any single disk failure can be supported by this scheme. However, using random replication across all disks does not support simultaneous failures of any two disks, since every pair of disk contains some data that is stored only on that disk pair. Imposing constraints on where data can be replicated can improve fault tolerance by allowing the system to survive multiple disk failures. However these same constraints can reduce the capacity for short term load balancing. One such replication scheme is chained declustering [16], where data whose primary copy is stored in a given disk i is replicated only in the next logically consecutive disk i + 1. This scheme allows multiple disk failure, provided only that no two failures occur on logically consecutive disks. Chained declustering pays for increased fault tolerance with less exible short term load balancing as our simulation results will show. A generalization of chained declustering, which we call overlapping clusters, allows the system designer to trade o fault tolerance and realtime performance. In the overlapping clusters scheme disks are divided into overlapping clusters of size C and oset by C/2 with respect to the neighboring clusters as illustrated on gure 12. The two copies of a replicated data block are then constrained to be located in one cluster that is associated with the data block (note that each disk belongs to two clusters and data stored on it can be associated with any one of the two clusters) This scheme allows multiple disk failures provided the disks do not belong to a same cluster. This scheme lies between random replication and chained declustering since it supports more disk failures than random replication and less than chained declustering, while providing better 30

capability of short term load balancing than chained declustering but not as good as random replication.

Figure 13: Performance without failure Simulation results for a sixty-four disk system are shown in Figure 13. The fraction of requests that miss the delay bound are shown for dierent replication schemes: RAID with partial random replication (25%), chained declustering, and overlapping clusters. The curves not drawn in the gure, correspond to cases where the fraction of requests missing the delay bound is less than 10?6 for all loads up to 99%. Clearly, full replication provides better results than partial replication (25%), even when full replication is subject to constrained data layout such as in chained declustering and overlapping clusters. We also observe that overlapping clusters provides better performance than chained declustering and increasing the size of the overlapping clusters improves the performance, as expected due to additional exibility in load balancing. It is also interesting to study the eectiveness of these fault tolerant schemes in the situation where there is a disk failure. Performance results for a sixty-four disk system and one disk failure are shown in Figure 14. These results assume that, when disk arrays are used, data is requested in terms of individual data blocks, and not in terms of entire parity groups.3 In this situation, a disk failure causes an increase in the system load, because unavailable data blocks need to be reconstructed by reading the other G-1 blocks of its parity group that, in general, would not be read if the disk had not failed. For all results in gure 14 we assumed 3 For video applications it is more reasonable to read entire parity groups (or at least the data blocks in a parity group) since access is highly predictable and sequential. For other types of applications, e.g. 3D interactive virtual world, larger granularity of reads can imply large amounts of super uous data being read.

31

Figure 14: Performance with one disk failure a parity group of size G = 5 (4 data blocks and 1 parity block), and thus a overhead of storage capacity of 50% (25% due to parity and 25% due to replication). The additional load causes the system to saturate with a lower nominal load for the RAID based schemes than for the others. The excess load is directed primarily to the disk array containing the failed disk, although a fraction of this load is redirected to other disk arrays by exploiting the partial replication. Thus, increasing the size of the disk array improves the system performance since the excess load is distributed among a larger number of disks. However increasing the size of the disk arrays reduces the reliability of the system since no multiple disk failures within a single disk array can be supported. The case of disk arrays with 8 disks shows a lower probability of missing the delay bound than the cases of larger disk arrays on saturation, simply because the excess load saturates only the disk array that contains the failed disk, while the redirected load to the other disk arrays is not enough to saturate them, and thus a fraction of requests sent to disk arrays without failure are completed before the delay bound. When the disk array size is larger, a higher load is redirected to the disk arrays without failure when the disk array with failure is saturated, saturating all the disks. The region of the curve for disk arrays of size 8 that is constant at approximately 10?1, correspond to the situation where the disk array with the failed disk is saturated. The performance of chained declustering under failure is only better than partial replication with parity groups when the disk array is small (8 disks in this example). When the size of the disk array is 16 disks the 32

Scheme

Mean time between "data inaccessible" failures K 2

Chained declustering

K D?1

Random replication (100%) (System with D disks) Disk array (size 8)

K 7

Disk array (size 16)

K 15

Disk array (size 64)

K 63

Overlapping clusters (size 4)

K 5

Overlapping clusters (size 8)

K 11

Table 1: Relative mean time between "data inaccessible" failures performance of chained declustering and RAIDs are equivalent for small delay bounds (case with a maximum of 5 requests per cycle), but for larger delay bounds (case with a maximum of 20 requests per cycle) chained declustering is worse. Disk arrays larger than 16 provide better performance than chained declustering under failure both with small and large delay bounds, with a lower cost on data storage overhead, but with less tolerance to multiple disk failures. In our experiments, overlapping clusters is the scheme that provided the best performance under disk failure (as well as without disk failure) and performed better for larger sizes of the overlapping clusters. However it has a cost of 100% storage capacity overhead and has less tolerance to multiple disk failures than chained declustering (but it has better or equal tolerance to multiple disk failures than RAID schemes). Table 1 shows the relative mean time between "data inaccessible" failures (failures where a fraction of data is unaccessible) for all cases. We assume that the mean repair time of disks is much smaller than the mean time between disk failures in the system. If this is the case the mean time between "data inaccessible" failures are dominated by double failures, and it is proportional to the number of disks in which a second 33

failure (after a failure in a speci c disk) would cause unavailability of data. Note that if the number of disks is very large this assumption may not hold and we may have to consider multiple failures with more than 2 failed disks, but in general this is a reasonable approximation. Table 1 shows the approximate mean time between "data inaccessible" failures relative to a constant K that can be computed as function of the mean time to failure of a disk and the mean time to repair a disk and is the same for all of the systems. The time to data inaccessibility is inversely proportional to the number of disks for which a second failure would cause inaccessibility of some portion of data. We conclude that if there is enough data storage available to support 100% replication, the best scheme is overlapping clusters which has very good performance both with and without disk failures and also has relatively good tolerance to disk failures. If however data storage is scarce, then the use of disk arrays with parity groups and partial replication (eventually with no replication if data storage is too scarce) is the best choice. The selection of the size of the disk array requires trading o the relative importance of tolerance to "data inaccessible" failures against system performance under single disk failure, with larger disk arrays providing higher performance while smaller disk arrays provide better fault tolerance. Also the selection of the level of replication and the size of parity groups is a tradeo between the desired percentage of additional storage capacity and the bene t of increased performance with higher levels of replication and increased performance under disk failure for smaller sizes of parity groups.

6 Performance Comparison of Random Allocation and Disk Striping with Synchronized Cycles In this section we compare the performance of random allocation with replication and the performance of systems using striping. In particular many versions of VOD systems for CBR video have been proposed in which videos are striped over all disks, time is divided into xed size intervals (called cycles) and each disk reads a set of (up to) n blocks per cycle. We refer to this as the classic system design just as a means to refer to it in this section. RIO is applicable to much more general workloads. So the comparison we do here strongly favors the classic striping systems which can be tailored to a CBR workload. We assume that each media stream requires a constant bit rate of 1.5 Mbits/sec (approximately the rate of a MPEG1 video) and that a buer of 256 Kbytes is allocated to each data stream and compute the maximum number of streams that can be supported by both schemes. After a given block is read it is then \played out" at a constant rate and the duration should match the 34

cycle length. The block size is set to half the size of the stream buer, since while the last read block stored is being consumed from one half of the buer the next block is being read in the other half. Since we are assuming a buer of size 256 Kbytes per stream the block size used is 128 Kbytes. The duration of a cycle should be equal to the time necessary for the user to consume one data block. For a required rate of 1.5 Mbits/sec and a block of 128 Kbytes, the cycle time is given by T = 1:5128 = 32 sec: = 0:667sec: 8 1024 To compute the number of disk block read requests that can be served by a disk in each cycle, we have to consider the cycle time distributions for dierent numbers of requests as were given in Figure 4. This is the largest number of requests which can be guaranteed to be read in the T = 0:667 second playout time. To have a fair comparison with random allocation we select the number of requests in a cycle such that the probability of having a cycle longer than the required value T is less or equal to the desired probability of exceeding the delay bound. To avoid being too conservative for the case of the classic system, we ignored possible extra delays due to thermal calibration in this case, although they were included in our scheme. (So, all assumptions are made to favor the classic striping system.)

Figure 15: Maximum cycle time We assume a probability of exceeding the delay bound equal to 10?6 and then compute the maximum cycle time Tmax as the time such that the probability of having a cycle longer than Tmax is equal to 10?6. 35

Scheme

Maximum number of streams

classic striping Random allocation with 100% replication Random allocation with 25% replication Random allocation with 0% replication

238 270 255 215

relative performance (w.r.t. classic striping) 100% 113% 107% 90%

Table 2: Comparison of random replication with classic striping Figure 15 shows this maximum cycle time Tmax as function of the number of requests in a cycle. We observe that to have maximum cycle time equal to 0.667 , the maximum number of requests per disk in a cycle is 17. For RIO, we use the same block size of 128 Kbytes and a delay bound of 0.667 sec. For these values we use the graphs of gure 9 to compute the maximum bandwidth per disk that the system can support for a probability of exceeding the delay bound of 10?6 and then divide this value by the stream data rate. From gure 9 the maximum supported bandwidth per disk is 2.88 Mbytes/sec, 3.43 Mbytes/sec and 3.62 Mbytes/sec for 0%, 25% and 100% of replication respectively. Using these values we show in table 2 the maximum number of streams that can be supported for both schemes assuming a system with 14 disks. Although the example is based on a 14 disk system, the relative performance of the system is approximately the same for any number of disks, since we have shown that the performance of random allocation is relatively insensitive to the number of disks, as well as the same is true for the striping system. The results of this example show that the performance of RIO's random allocation scheme is competitive with the performance of classic striping for a CBR video workload. In this example the performance of random allocation is on the order of 10% better than the striping system when replication is used (even with partial replication) and 10% worse when no replication is used. Although random replication has to operate at a lower throughput than the aggregate disk capacity, so does the striping design. For striping with synchronized cycles since at the tail end of cycles many disks are idle waiting for the beginning of the next cycle throughput capacity is lost. Since our results of sections 4 and 5 show that the performance of random allocation is not far from the the maximum disk eective bandwidth, we can conclude that random allocation will be competitive with classic striping for all buer sizes, playout rates, disk characteristics, etc. However, we expect the performance 36

of random allocation to be actually better than striping in most cases. Moreover, random replication has many other advantages over striping such as exibility to support streams with dierent playout rates, VCR functionality, exibility to switch among multiple data representations for dynamic quality of service, etc.

7 Conclusion We have presented the principal issues involved on the design of a parallel storage system for realtime multimedia applications. The realtime storage system described in this paper was designed and implemented as part of the VWDS project at UCLA. The principal characteristics of the proposed storage system is random allocation of blocks to disks and use of partial replication for improving load balancing and thus achieving smaller probabilities of exceeding the delay bound for realtime requests. Our experimental results showed the validity of our approach, and demonstrated that is possible to achieve small probabilities of exceeding the delay bound (less than 10?6 ) with high disk utilization (70% to 95% of the disk eective bandwidth), for relatively small request delay bounds (0.5 to 1.5 seconds) using partial replication and 60% to 80% disk utilization when no replication is used. We also showed that full replication allows almost full disk utilization (up to 99% of the eective disk bandwidth). We also discussed dierent approaches for fault tolerance schemes and analyzed their performance both when there is a disk failure and under normal operational conditions. We proposed a fault tolerance scheme which we called overlapping clusters that has both good performance characteristics and reliability with one failed disk. Overlapping clusters is the preferred scheme when storage space is not scarce and 100% replication is acceptable. In systems where storage space is scarce relative to bandwidth the use of RAID schemes combined with partial replication would be preferred. Finally we compared the performance of our scheme with a classic striping system under the ideal workload for striping; videos in which each playout stream requires the same constant bit rate. Even in this ideal situation we showed that our scheme has superior performance to classic striping when full or partial replication is used, while its performance is only 10% below striping when no replication is used.

References [1] Y. Azar, A.Z. Broder, A.R. Karlin, E.Upfal, \Balanced Allocations", Proc. 26th Annual ACM Symposium on the Theory of Computing (STOC 94), pp. 593-602, 1994 37

[2] S. Berson, L. Golubchik and R.R. Muntz "Fault Tolerant Design of Video-on-Demand Storage Servers" SIGMOD 95, pp. 364-375, 1995. [3] S. Berson, R. Muntz, W. Wong, \Randomized Data Allocation for Real-time Disk I/O", Compcon 96, pp.286-90, 1996. [4] Birk, Y., \Track-Pairing: A Novel Data Layout for VOD Servers with Multi-Zone-Recording Disks", Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 248-255, May 1995. [5] J. Bucklew, Large Deviation techniques in Decision, Simulation, and Estimation, New York, Wiley, 1990. [6] E. Chang, A. Zakhor, \Variable Bit Rate MPEG Video Storage on Parallel Disk Arrays", 1st International. Workshop on Community Networking: Integrated Multimedia Services to the Home, IEEE Press, pp. 127-137, 1994. [7] Chen, L.T., Rotem,D., \Declustering Objects for Visualization", VLDB 1993, Dublin, Ireland, pp. 85-96, 1993. [8] A.L. Chervenak, A.A. Patterson, R.H. Katz, \Choosing the Best Storage System for Video Service", ACM Multimedia 95, pp.109-119, 1995. [9] T. Chiueh, R.H. Katz \Multi-Resolution Video Representation for Parallel Disk Arrays", ACM Multimedia 93, pp. 401-9, 1993. [10] Funkhauser, T., Sequin, C., Teller, S., \Management of Large Amounts of Data in Interactive Building Walkthroughs", ACM SIGGRAPH Proc. of the 1992 Symposium on Interactive 3D Graphics, 1992. [11] D.L. Eager, E.D. Lazowska, J. Zahorjan, \Adaptive Load Sharing in Homogeneous Distributed Systems", IEEE Trans. on Software Engineering, pp. 662-675, 1986. [12] W. Gekelman, D. Leneman, J. Maggs, \Experimental Observation of Alfven Wave Cones", Physics of Plasmas, 1, pp.3775-3783, 1994. [13] Gemmell, D., Vin, H., Kandlur, D., Rangan, V., Rowe, L., \Multimedia Storage Servers: A Tutorial", IEEE Computer, pp.40-49, May 1995. 38

[14] E. Grochowski, R.F. Hoyt \Future Trends in Hard Disk Drives", IEEE Transactions on Magnetics, Vol. 32, No. 3, May 1996. [15] J Hsieh, M. Lin,J.C.L. Liu, D.H.C Du, \Performance of a Mass Storage System for Video-On-Demand". Journal of Parallel and Distributed Computing, Vol. 30, No. 2, pp.147-67, Nov. 1995. [16] H. Hsiao, D.J. Dewitt, \Chained Declustering: A New Availability Strategy for Multiprocessor Database Machines", Proc. of Data Engineering, pp 456-65, 1990. [17] K. Jeay, D.F. Stanat, C.U. Martel, \On Non-preemptive Scheduling of Periodic and Sporadic Tasks", Proc. of Real-time Systems Symp., pp.129-139, Dec. 1991. [18] Jepson, W., Liggett, R., Friedman, S., \Virtual Modeling of Urban Environments", Presence: Teleoperators and Virtual Environments, Vol. 5, No.1, MIT Press, 1996. [19] W. Karplus, M.R. Harreld, \The Role of Virtual Environments in Clinical Medicine: Scienti c Visualization", Proc. First Joint Conference of International Simulation Societies (CISS), Zurich, Switzerland, pp. 13-17, September 1994. [20] E. W. Knightly, \H-BIND: A New Approach to Providing Statistical Performance Guarantees to VBR Trac", Proc. of IEEE INFOCOM, 1996. [21] C.L. Liu, J.W. Layland, \Scheduling Algorithms for Multiprogramming in a Hard Real-time Environment" journal of ACM, pp 46-61, 1973. [22] Miller, E.L., Katz, R.H., \RAMA: Easy Access to a High-Bandwidth Massively Parallel File System", USENIX 95, pp. 59-70, 1995. [23] M.D. Mitzenmacher, \The Power of Two Choices in Randomized Load Balancing", PhD Dissertation, University of California at Berkeley, Computer Science Department, 1996. [24] R. Muntz, J. C-S. Lui, \Performance Analysis of Disk Arrays Under Failure", Proc. VLDB, Brisbane, Australia, pp. 162-173, 1990. [25] D.A. Patterson, G. Gibson, R.H. Katz, \A Case for Redundant Arrays of Inexpensive Disks (RAID)", SIGMOD 88, pp.109-116, 1988. [26] Reddy, A.L.N., Wyllie, J., \Disk Scheduling in a Multimedia I/O System", ACM Multimedia 93, pp. 225-233, 1993. 39

[27] Ruemmler, C., Wilkes, J., \An Introduction to Disk Driving Modeling", IEEE Computer, pp1728,March 1994. [28] J.R. Santos, R. Muntz \Design of the RIO (Randomized I/O) Storage Server", UCLA CSD Technical Report, Jun 1997. [29] M. Seltzer, P. Chen, J. Ousterhout, \Disk Scheduling Revisited" USENIX Winter 90, pp.313-324, 1990. [30] R. Tewari, R. Mukherjee, D.M. Dias, H.M. Vin, \Design and Performance Tradeos in Clustered Video Servers" Proceedings of the International Conference on Multimedia Computing and Systems,Los Alamitos, CA, IEEE Comput. Soc.Press, pp. 144-50, 1996. [31] F.A. Tobagi, J. Pang, R. Baird, M. Gang, \Streaming RAID - A Disk Management System For Video Files" ACM Multimedia 93, pp.393-400, 1993. [32] H. Vin, S. Rao, P. Goyal, \Optimizing the Placement of Multimedia Objects on Disk Arrays", Proc. International. Conf. on Multimedia Computing and Systems, IEEE Press, pp. 158-165, 1995. [33] J.L. Wolf, P.S. Yu, H. Shachnai, \DASD Dancing: A Disk Load Balancing Optimization Scheme for Video-on-Demand Computer SYSTEMS". SIGMETRICS 1995,pp.157-166, 1995. [34] B.L. Worthington, G.R. Ganger, Y.N. Patt, J. Wilkes, \On-Line Extraction of SCSI Disk Drive Parameters" SIGMETRICS 95, pp.146-156, 1995. [35] P. Yu, M. Chen and D. Kandlur, \Design and Analysis of a Grouped Sweeping Scheme for Multimedia Storage Management", NOSSDAV'92, pp.44-55, 1992.

40

Lihat lebih banyak...

A parallel disk storage system for real-time multimedia applications

Descrição do Produto

Comentários