Sensitivity Analysis of Server Virtualized System Availability

July 7, 2017 | Autor: Kishor Trivedi | Categoria: Computer Software, Electrical And Electronic Engineering
Share Embed


Descrição do Produto

1

Sensitivity Analysis of Server Virtualized System Availability Rubens de S. Matos, Paulo R. M. Maciel, Members, IEEE, Fumio Machida, Dong Seong Kim, Members, IEEE, and Kishor S. Trivedi, Fellow, IEEE

Abstract Server virtualization is a technology used in many enterprise systems to reduce operation and acquisition costs, and increase the availability of their critical services. Virtualized systems may be even more complex than traditional non-virtualized systems; thus, the quantitative assessment of system availability is even more difficult. In this paper, we propose a sensitivity analysis approach to find the parameters that deserve more attention for improving the availability of systems. Our analysis is based on Markov reward models, and suggests that host failure rate is the most important parameter when the measure of interest is the system mean time to failure. For capacity oriented availability, the failure rate of applications was found to be another major concern. The results of both analyses were crossvalidated by varying each parameter in isolation, and checking the corresponding change in the measure of interest. A cost-based optimization method helps to highlight the parameter that should have higher priority in system enhancement.

Index Terms R. S. Matos J´unior and P. R. M. Maciel are with the Center of Informatics, Federal University of Pernambuco, Recife, Brazil. e-mails: [email protected], [email protected] K. S. Trivedi is with Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA. e-mail: [email protected] Dong Seong Kim is with Department of Computer Science and Software Engineering at University of Canterbury, Christchurch, New Zealand. He is also with Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA. e-mail: [email protected] Fumio Machida is with Service Platforms Research Laboratories, NEC Corporation, Japan. He is also with Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA. e-mail: [email protected]

2

Availability modeling, cost analysis, continuous time Markov chain, mean time to failure, sensitivity analysis, server virtualization.

ACRONYMS COA

Capacity Oriented Availability

CTMC

Continuous Time Markov Chain

GSPN

Generalized Stochastic Petri net

MTTF

Mean Time to Failure

RBD

Reliability Block Diagram

SAN

Storage Area Network

SLA

Service Level Agreement

SRN

Stochastic Reward Net

TCO

Total Cost of Ownership

VM

Virtual Machine

VMM

Virtual Machine Monitor N OTATION

π

Steady-state probability vector

Sθ (Y )

Sensitivity of measure Y with respect to parameter θ

SSθ (Y ) Scaled sensitivity of measure Y with respect to parameter θ Q

CTMC generator matrix

E[X]

Expectation of random variable X

ri

Reward rate assigned to state i I. I NTRODUCTION

In data centers, server virtualization has become an important technology to reduce the costs, ease the administration, and increase service availability, among other benefits. Server virtualization is also used in cloud computing because virtual machines (VM) provide high flexibility and low maintenance effort. All hardware maintenance work and most software maintenance tasks are delegated to the administrators of the cloud computing service. Some server virtualization products support high availability and fault tolerance mechanisms [6] [36], but the quantitative analysis of the impact of using virtualization to the overall system

3

availability has not been intensively studied. The administrators have to deal with new failure possibilities, caused by the introduction of additional layers, like the Virtual Machine Monitor (VMM), and new dependencies among virtual machines, software, and hardware components. Modeling of complex relations between software and hardware failures is needed to achieve accurate assessment of availability of virtualized systems. Some analytic techniques have been used to model these systems. A continuous time Markov chain (CTMC) was used to evaluate the dependability of a virtualized single server system [40]. Papers [19], [26] used stochastic reward nets (SRNs) to represent virtualized two servers’ systems, to capture failures, reactive recovery, and proactive recovery by software rejuvenation. Another interesting approach is based on two-level hierarchical modeling which uses fault trees in the upper level, and CTMC in the lower level [16]. The support for sensitivity analysis in these analytical models is important for detecting bottlenecks in system availability. Sensitivity analysis is traditionally performed by a discrete sensitivity analysis, which involves simply varying input parameters over their value ranges, and graphing the effects on output measures. Differential parametric sensitivity analysis is a local sensitivity analysis method hence it is applicable for calculating measures’ sensitivity due to small parameter variations. One of the advantages of differential sensitivity analysis when compared to discrete methods is the computation time needed. In this paper, we propose a method based on parametric sensitivity analysis of CTMCs for evaluating the sensitivity of availability related measures. This method was applied to analyze VMs’ availability associated measures due to failure, recovery, migration [36], and their applications’ failure and repair behaviors. We implemented the parametric sensitivity analysis module in the software package SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) [34], and use this technique to determine bottlenecks in the virtualized system availability. The rest of the paper is organized as follows. In Section II, we describe the background of server virtualization and its modeling for availability analysis. In Section III, a brief theory about sensitivity analysis of Markov chains is presented. In Section IV, we give more details about the model used for the case study, and the numerical results of its sensitivity analysis are presented in Section V. Finally, our conclusions are in Section VI.

4

Fig. 1.

Layers of a classic virtualized system.

II. BACKGROUND A. Server virtualization Virtualization of systems emerged during the 1960s and early 1970s. This technology provided a way for different user groups to run different operating systems on shared mainframe computer systems [27]. The VMMs (also known as hypervisors) are developed to provide a softwareabstraction layer that partitions a hardware platform into one or more VMs [13]. A typical system configuration using VMM is shown in Fig. 1. VMM manages all hardware resources as an operating system, and accesses the resources in accordance with operations requested by the guest operating system (OS) and application processes. The 1980s saw a decrease in hardware costs that drove many organizations to migrate from large centralized mainframes to a collection of departmental minicomputers [20]. This downsizing trend reduced the motivation for using systems virtualization during the 1980s and 1990s. In the last decade, virtualization technology has received renewed attention in response to the growth of underutilized computer resources. This renewed trend of virtualization is also encouraged by the requirements for reducing the costs related to system administration, scarcity of floor space, power consumption, and thermal dissipation. Virtualization enables the consolidation of different services to improve resource utilization, and reduces the Total Cost of Ownership (TCO). Recently, VMs have been used for critical mission applications, and advanced features like live migration (e.g., Citrix XenMotion [6], Microsoft Hyper-V [32], etc.) and VM high availability

5

(e.g., VMware H.A. [36]) services have been adopted to automatically restart clean VMs on other hosts in the face of host failures. The VM can move back to the original host when the host is recovered. Fig. 2 sketches how this mechanism works. In the upper left corner, two hosts are active, each running an application (APP1 and APP2) on top of a VM (VM1 and VM2, respectively), that in its turn is managed by a VMM (VMM1 and VMM2). A Storage Area Network (SAN) is used to guarantee that data are available for both applications regardless of which machine they are running on, as well as to store the virtual machine bootable images. When a failure is detected (by an external monitoring mechanism) on the host that runs APP1, the VM1 (and subsequently the application) is restarted on the other host (using a mechanism such as VMware H.A.). Thence, the second host runs VM1 and VM2 until the first host is repaired. When the repair is completed, and the first host is active again, VM1 is migrated to that host without perceptible service interruption. The continuous development of such features reveals that virtualization is not only a software platform for multitasking, but also an attractive alternative for fostering system availability and reliability. B. Availability modeling of virtualized systems Some approaches have been proposed to assess the availability of virtualized systems. In [25], the simple paradigm of the reliability block diagram (RBD) is adopted for analyzing the effects of virtualization on the system dependability. The results highlight the importance of the hypervisor’s reliability and the number of VMs to the system reliability, and show that the system reliability may decrease by the introduction of virtualization. The cited paper only considered host level failures, and did not incorporate functionally interdependent hardware and software failures and the respective recovering mechanisms. In [31], a CTMC is used to capture the behavior of a virtualized system, but the proposed model captures only VM level behavior. A more detailed analysis is presented in [16], which adopts a hierarchical modeling strategy based on fault trees and CTMCs. The analysis takes into account hardware device failures (CPU, memory, power, etc.), and software failures (VMs, VMM, applications). The effect of VM high-availability services on availability measures such as downtime is also studied. The results of discrete sensitivity analysis show the changes in system availability by varying some parameter values. Besides the analytic approaches, an experimental analysis was proposed in [38]; the objective

6

Fig. 2.

Availability improvement using VM services.

was to quantify slowdown and downtime experienced by a specific web application when VM migrations are performed at run-time. The results showed that migration overhead is acceptable, but cannot be disregarded, for systems regulated by strict Service Level Agreements (SLAs) based on availability and responsiveness requirements. In this paper, we study the consequences of migration on the availability of virtualized systems, but using a differential parametric sensitivity analysis. III. PARAMETRIC S ENSITIVITY A NALYSIS Sensitivity analysis is a method to determine factors that are most influential on model output measures [14]. To be precise, there are two kinds of sensitivity analysis. The first kind is a nonparametric sensitivity analysis, which may study output variations caused by modifications in the structure of a model (e.g., addition or removal of a given component in a model) [3]. The other

7

kind is a parametric sensitivity analysis, which studies the variations of output due to a change in the input parameter values [11]. Among other aspects, a system may be characterized by how its parameters are suitably represented in a proper model, the order of differential or difference equations, and its linearity or nonlinearity [11]. Initial conditions and time invariance or variance are also essential to study the effects of parameters’ variations on the system behavior. The adoption of scaled or unscaled sensitivity analysis depends on the type of output measure [28], whether or not removals of units are useful, the range of values, and is useful for coping with complexity of solutions. The terms scaling, normalization [10], standardization, and nondimensionalization [12] are many times interchangeably used, and in specific circumstances, they deserve particular distinction. This work adopts the term scaling as a synonym of the other mentioned words, but whenever one of the terms might better explain the context, it might be applied in that specific context. In many planning studies, several measures are represented in different units, and they are commonly involved in the decision processes. Measure scaling is an alternative for handling concomitant measure units and the respective harmonization. The range of values is also a point of view that should be taken into consideration. If a set of measures spans a large range of values, logarithm scaling might be valuable. Logarithm scaling may also be adopted when the system behavior varies as powers of some attributes. Logarithms might alleviate the solution of models because a product might be computed through a sum of logarithms, and differential equations might be solved as systems of linear equations if suitable transformations are applied, among other important properties. Parametric sensitivity analysis has been particularly applied to assess the effect of changes in the performance and dependability of components on system measures of interest. This approach can be used to find performance and reliability or availability bottlenecks in systems, and guide the improvement and optimization of systems [4]. Differential sensitivity analysis is performed by computing the partial derivatives of the measure of interest with respect to each input parameter. Subsequently, the sensitivity of a given measure Y , which depends on a specific parameter λ, is computed as in (1), or (2) for a scaled sensitivity.

Sλ (Y ) =

∂Y , ∂λ

(1)

8

(

∂Y SSλ (Y ) = ∂λ

λ Y

)

.

(2)

Some papers studied parametric sensitivity analysis in continuous time Markov chains (CTMC) [4], generalized stochastic petri nets (GSPN) [21], and queueing systems [39]. This paper focuses on two measures of Markov reward models: steady-state probability, and steady-state expected reward rate. The measures are used to analyze the steady-state availability, equivalent mean time to system failure, and capacity oriented availability of the server virtualized system presented in Section IV. To compute the sensitivity indices of steady-state probability, the derivative of the following set of equations should be calculated.

πQ = 0, ∑

πi = 1.

(3) (4)

i

In (3) and (4), Q denotes the CTMC generator matrix, and π represents the steady-state probability vector. Therefore, ∂π ∂Q Q = −π , ∂θ ∂θ ∑ ∂πi = 0. i ∂θ

(5) (6)

Details related to the solution of these equations using Successive Over Relaxation (SOR) are given in [5], [29]. We implemented this solution module in the SHARPE software package [34]. In a similar manner, to compute the sensitivity of E[X], the expected steady-state reward rate, the derivative of (7) is required. If the reward rates ri associated with the model states are functions of the parameter θ, then the sensitivity is expressed by (8). If reward rates do not depend on this parameter, then the respective sensitivity is computed by (9). Both cases are now supported in our implementation in SHARPE.

E[X] =

∑ i

ri πi

(7)

9

∑ ∂πi ∂E[X] ∑ ∂ri = πi + ri ∂θ ∂θ i ∂θ i

(8)

∂E[X] ∑ ∂πi = ri ∂θ ∂θ i

(9)

IV. A C ASE S TUDY

(a) Non-virtualized system Fig. 3.

(b) Virtualized system

Architectures of two hosts system.

In this section, we present a study focused on a virtualized system composed of two hosts in which each host has one VM running on a VMM. Figs. 3a and 3b depict the hardware and software parts of this system. Fig. 3a illustrates a system without VMM and VMs (non-virtualized system), in which the operating system of each server (OS1 and OS2) directly uses the underlying hardware (CPU, memory, power supply, network interface card (NIC), and cooling device). Fig. 3b presents a virtualized system implemented on the hardware infrastructure depicted on Fig. 3a. This virtualized system has one VMM running on each hardware host machine. The VMMs are responsible for providing the access to hardware resources to the guest operating systems, which run on top of the VMM. Both hosts share a common SAN that helps to support VM live migration. In case of a host failure, the VM running in that host may be migrated to the other host. The same application is executed in both hosts, and we denote them as App1 and App2 to stress that they are two processes running in their respective hosts. The application is configured as an active-active cluster in a virtualized system [18], i.e., both applications are responsible for processing incoming requests, and in case of a failure of one, the application on

10

Fig. 4.

VMs’ availability model.

the other node processes the requests that were supposed to be processed by the failed application instance. A hierarchical model is adopted for representing the virtualized system in [16]. A top level fault tree model is used in that paper to provide a high-level dependability point of view of the system shown in Fig. 3b. The availability of each component (or subsystem) is computed through individual Markov chain sub-models, each of which represents a particular system component. In the remainder of this section, we show the details of the VMs’ subsystem availability model, followed by the cost functions defined for this system. Afterward, we present the sensitivity functions, based on availability and cost metrics, which are used to analyze the VMs’ subsystem. A. VMs’ availability model In this section, we introduce the VMs’ subsystem availability model, depicted in Fig. 4. The detailed behavior of hardware components (e.g., CPU, memory, cooler) and the VMM are not considered in this paper, but only the VM, Host, and Application components. We have used a notation for the model states that is based on the current condition of each component. The first character represents the state of the first host: up (U), failed (F), or failure detected (D). The

11

second character represents the state of the first VM and its application: both up (U), App1 failed (Fa), App1 failed and detected (Da), App1 failed and an additional repair is needed because the application restart does not solve the problem (Pa), VM1 failed (Fv), VM1 failed and detected (Dv), VM1 failed and a manual repair is needed (Pv), VM1 and App1 are restarting (R), and VM1 and App1 are not running on the host1 (X). The third character represents whether or not VM2 and App2 are running on host1: no (X), running on host1 (U), or restarting on host1 (R). The fourth character represents the state of the second host: up (U), failed (F), or failure detected (D). The fifth character represents the state of the second VM and its application, following the same notation used for VM1 and App1. The sixth and last character represents whether or not VM1 and App1 are running on host2, using the same notation as the third character. This availability CTMC model comes after a state truncation of a more complex SRN (Stochastic Reward Net) model that considers all possible consecutive failures (of applications, VMs, and hosts). We verified that the model obtained by state truncation provides accurate results, when compared to the complete model. Therefore, we use only this CTMC model that considers at most two consecutive host failures. The detailed meanings of some states are explained in Table I. The other states are easily understood in a similar manner by changing the host number. We shaded the states in which the system is down. In these states, both applications are not working properly, so the service is unavailable. We consider that a covered failure in an application is recoverable by means of a simple application restart, whereas an uncovered failure needs manual, and longer, intervention for repairing the application. Table II shows the input parameter values used for computing the availability of the system. Most of these values are found in the literature. Application server failure rates are mentioned in [30], as well as the mean time to repair for hardware and application failures. From the fraction of imperfect recovery (FIR) reported in [30], we derived the coverage factor for VM repair. The time for detection of host failure is based on the usual configuration reported in [37]. The time for detection of application and OS failures was chosen according to default values of popular monitoring tools (e.g., Nagios [22]). VM live migration, and restart times are characterized in [7], and [17], respectively. The remaining values were obtained from experimental studies, or were “estimated guesses” based on our empirical knowledge.

12

TABLE I N OMENCLATURE OF STATES

State

Description

UUXUUX

VM1 running on H1, VM2 running on H2

UFaXUUX

App1 failed, both VMs and Hosts are up

UDaXUUX

App1 failure is detected

UPaXUUX

App1 failure is not covered. Additional recovery step is started.

UFvXUUX

H1 up, VM1 failed, VM2 running on H2

UDvXUUX

VM1 failure is detected

UPvXUUX

VM1 failure is not covered. Manual repair is started.

UXXUUU

VM1 and VM2 are running on H2

FUXUUX

H1 failed, VM2 running on H2

DXXUUR

H1 failure is detected, VM1 is restarted on H2

DXXUUU

H1 is down, VM1 and VM2 running on H2

UXXFUU

H1 up, H2 failed when VM1 and VM2 were running on it

UXXDUU

H2 failure is detected and two VMs are on H2

DXXFUU

H1 is down, H2 failed when VM1 and VM2 were running on it

DXXDUU

H1 is down, H2 failure is detected

DXXURR

H1 is down, H2 is up, VM1 and VM2 are restarting (booting) on H2

UXXURR

H1 is up, H2 is up, VM1 and VM2 are restarting (booting) on H2

B. Cost functions for the VMs’ availability model To maintain a virtualized system that holds an application hosting service, the owner of the system needs to consider the cost of system operations as well as that of system unavailability. Assuming a small application hosting service running on a virtualized system, we analyze the total cost of system operation including the cost related to SLA violations and the cost of actions to improve the failure and recovery parameters. The costs taken into account here are categorized into four types: 1) SLA violation cost, 2) power consumption cost, 3) hardware procurement and replacement cost, and 4) maintenance manpower cost. This paper considers a scenario where the system is first evaluated to check whether improvements related to its availability are needed or not. Therefore, we define a cost function that encompasses both the operational cost in a specified period, detailed in the four types just described, plus a fifth type, the related cost for

13

TABLE II I NPUT PARAMETERS FOR THE VM S MODEL

Parameter

Description

Value

1/λh

Mean time for host failure

2654 hr

1/λv

Mean time for VM failure

2893 hr

1/λa

Mean time to Application failure

175 hr

1/δh

Mean time for host failure detection

30 sec

1/δv

Mean time for VM failure detection

30 sec

1/δa

Mean time for App failure detection

30 sec

1/mv

Mean time to migrate a VM

330 sec

1/rv

Mean time to restart a VM

50 sec

1/µh

Mean time to repair a host

100 min

1/µv

Mean time to repair a VM

30 min

1/µ1a

Mean time to App first repair (covered case)

1 min

1/µ2a

Mean time to App second repair (not covered case)

20 min

cv

Coverage factor for VM repair

0.95

ca

Coverage factor for application repair

0.8

updating the system in that specific period. In other words, we considered only one system update in the respective period. The definition of the cost functions are given next. 1) SLA violation cost: This cost is related to violations of SLA contract clauses. The service provider and the customer define the availability level of services to be provided in the SLA [1]. The service level may be formally specified according to the reward rates, rSLAi , assigned to states of the system according to the number of application instances available:     0    

rSLAi = cS1 

    cS0

(if i ∈ S2) (if i ∈ S1) (if i ∈ S0)

where S2 is the set of CTMC states where both application instances are available, S1 is the set of the CTMC states which have one available application instance, and S0 is the set of the CTMC states which do not have any available VM for the application (i.e., system down states). Table III shows the states that constitute each of these sets. In the definition of reward rates

14

rSLAi , parameters cS1 , cS0 ∈
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.