A simple suboptimal algorithm for system maintenanceunder partial observability

Share Embed


Descrição do Produto

Annals of Operations Research 91(1999)25–40

25

A simple suboptimal algorithm for system maintenance under partial observability Israel David, Lea Friedman and Zilla Sinuany-Stern Department of Industrial Engineering and Management, P.O. Box 653, Ben-Gurion University, Beer-Sheva, Israel E-mail: [email protected]

We suggest a heuristic solution procedure for Partially Observable Markov Decision Processes with finite action space and finite state space with infinite horizon. The algorithm is a fast, very simple general heuristic; it is applicable for multiple states (not necessarily ordered) multiple actions and various distribution functions. The quality of the algorithm is checked in this paper against existing analytical and empirical results for two specific models of machine replacement. One model refers to the case of two-action and two-system states with uniform observations (Grosfeld-Nir [4]), and the other model refers to a case of many ordered states with binomial observations (Sinuany-Stern et al. [11]). The paper also presents the model realization for various probability distribution functions applied to maintenance and quality control. Keywords: partially observed Markov decision process: algorithms, suboptimal design, dynamic programming: Bayesian programming, reliabilityymaintenance: machine replacement

1.

Introduction

A Partially Observable Markov Decision Process (POMDP) is described as a sequential decision-making process which pertains to the proper operation of an operating system whose physical state cannot be determined with certainty. Even when the relevant state space of the system is finite, the POMDP problem entails an infinite state space where the points of this space are the state probability vectors. The vector state is updated at the beginning of each period according to the observations which have been made on the system and the decisions taken thus far. In each period, the decision maker has to choose one of a finite set of alternative actions (e.g. of maintenance and replacement) so as to optimize a certain overall reward functional. Fully observable Markov processes are extensively investigated in the operations research literature and comprise one of the major generalizations in the field. In practice there are, however, more partially observed Markov systems than fully observed ones. Often due to data inaccuracy and lagged information, a manager is not © J.C. Baltzer AG, Science Publishers

26

I. David et al.

Simple algorithm for system maintenance

sure about the exact state of the system his machine is really in; for example, he may know the rate of defectives (or breakdowns), but not the exact state of the system. Thus, there is more potential to POMDP versus the fully observed Markov decision process. Monahan [9] advocates a wide range of practical applications, including quality control, maintenance, scheduling, military encounters, data communications, health care, education, artificial intelligence and expert systems, natural resource economics, and finance. Focusing on production systems, many references to the literature can be made. See White [13] and Lovejoy [8]. Further interesting real-life production applications may be found in [17,18]. The work [8] deals, in particular, with a machine replacement example with three states and four possible actions. The three states are (1) two parts working, (2) one part working, and (3) both components failed. The four actions are either to manufacture without examining the final product (M), to manufacture with a final product examination (E), to interrupt the line and inspect both components and replacing the failed ones (I), or to interrupt the line and replace both components without inspecting them (R). Kim and Ho [5] give an example of quality assurance problem of supplies; there are three quality states (GOOD, MEDIUM and BAD) and three possible actions (doing nothing, warning, and claim). Lane [6] gives an intriguing application for the fishing industry: fishing vessels must decide in which zone to fish in each period so as to catch most fish with the highest return. There are three states of catch level: LOW, MEDIUM and HIGH. The level of fishing in each zone is unknown in advance, but information on the actual CATCH of the previous period is given and times between successive encounters of salmon on the lures of troll lines are assumed exponentially distributed with known parameter λ i , dependent on the state of each zone (additional probability distributions were considered as well). Sinuany-Stern [10] gives an example regarding car replacement policies for a car rental company. A car check-up is costly and often misleading, still the car history of monthly breakdowns is available and reasonably accurate. Grosfeld-Nir [4] deals with a production system with two states: GOOD and BAD, two actions: CONTINUE (CON) or REPLACE (REP). The quality of the output observed is uniformly distributed with different domain intervals for each state. The work [4] seems to be the only existing analytical solution of an infinite horizon nontrivial POMDP. Sinuany-Stern et al. [11] offer an efficient algorithm for a machine which deteriorates down n ordered states; there are again two actions: CON or REP. The number of defective items is observed from a sample of products in each stage. The model is thus limited to two-action ordered states with binomial observations. POMDPs were successfully reviewed for theory and for state-of-the-art solution algorithms by Monahan [9], Lovejoy [7] and White [14]. Bounds on the value function have also been developed, either a priori bounds (see e.g. [12] and [16]) or bounds that evolve through the approximate solution technique (e.g. [8]). Yet, until now there has been only limited technical experience applying general solution schemes for POMDPs. Apparently, the reason for this is the resultant infinite state space and the

I. David et al.

Simple algorithm for system maintenance

27

far greater complexity issues that make the partially observed problem much more difficult to analyze and solve than its fully observable counterpart. Present-day techniques for finding precise optimal solutions are confined to small-scale problems with a finite planning horizon. Although the tractability of Lovejoy’s algorithm [8] is insensitive to the problem horizon, and recent developments by White and Scherer [16] produced a design which is relatively insensitive to the cardinality of the state space, it is still true that it is hard to solve computationally problems truly large in size (for an exact definition of the “size” of a problem, see [7].) All in all, while proven properties of the optimal solution are of considerable theoretical and practical value, POMDPs are short in analytical solutions and in numerical tractability of the generic algorithms. The objective of the present research is to develop a heuristic but reasonably efficient algorithm for easily solving large POMDP problems of enough generality: we deal with more than two (but finitely many) actions, general distribution of the signal (the observation structure), and unordered (finitely many) system states. We check the efficiency of the proposed algorithm, partly by comparing it to existing algorithms in the literature via extensive simulation. As stated in [7], “only numerical experience will develop a conventional wisdom as to what extent heuristics are effective”. Specifically, we further investigate the aforementioned 2-state 2-action replacement model [4] and use the results to present the exact efficiency ratios of our proposed approximation for various combinations of parameters. The second reference method is the special-purpose multistate efficient algorithm [11] which deals with binomial observation. The comparison to the resultant new procedure for this case is conducted via a simulation experimental design. Section 2 precisely states the infinite horizon discounted model. Several examples from the field of maintenanceyquality control that exemplify the model are grouped in appendix A. Section 3 specifies the proposed heuristic and demonstrates its use: our algorithm is based on a single execution of Howard’s policy iteration procedure. We then discuss, in section 4, the quality of the algorithm relative to the two reference test cases according to the lines discussed above; appendices B and C provide the background information needed for the reference models [4] and [11]. Summary and conclusions appear in section 5. 2.

The model

An operating system can be at each period t in one of n states ( j = 1,…, n). State transitions occur according to a Markov chain. The decision maker is unable to observe the state of the system directly and must make hisyher decisions sequentially, based upon partial information gathered as the system evolves through time. At each period, the decision maker has to take an action to maintainyreplace the system so as to possibly improve its reliability in order to maximize the expected infinite compounded reward. The states need not necessarily be ordered: they can represent, for example,

28

I. David et al.

Simple algorithm for system maintenance

various combinations of malfunction of subsystems. Similarly, the action (decision) space D may be comprised of any finite set of possible operations, e.g. to repair, maintain or replace any combination of parts of the system, or the whole of it. We then refer to POMDPs in discrete time, t = 0, 1, 2,… with an infinite planning horizon, with system state space X, an action-space D = {1,…, K} and an observation space Y. All these spaces but Y are assumed finite. The interplay of the following unravels the process and affects its gain: (a)

D(t) ∈D, decision taken at times t, t = 0, 1, 2,… .

(b)

An observation on the system Y(t), and invariant conditional probabilities ryj = Pr{Y(t) = y | X(t) = j},which are assumed known. Y| j may also be assumed to be a continuous random variable, with a known conditional density f ( y | j).

(c)

Given transition probability matrices Pijk = Pr{X(t) = j| X(t − 1) = i, D(t − 1) = k}.

(1)

(d)

Given real values q(i, k) which signify the expected incurred immediate rewards in taking action k while being in system state i.

(e)

The vector of state probabilities π(t) = (π 1 (t), π 2 (t),…, π n (t)), where π i (t) = Pr{X (t) = i| Y(t), D(t − 1), Y(t − 1),…, D(1), Y(1), D(0)}.

(2)

The probabilities π i (t) are updated and reassigned to the system states by the decision maker at every step t, given the history of the process. However, the update of π i (t) is Bayesian such that the detail of history is not used directly. Specifically, π i (t) = in the discrete case, or π i (t) =

ryi π(t − 1)Pi

D(t −1)

n ∑l = 1 ryl π(t − 1)PlD(t −1)

f (y|i)π(t − 1)Pi D(t −1)

n ∑ l =1 f (y|i)π(t − 1) PlD(t −1)

(3a)

(3b)

when Y|i are all continuous. Equations (3a)–(3b) hold whenever the division is applicable and where Pl is the l th column of P, and matrix multiplication is implied where appropriate. The fact that the information from past decisions and state vectors is embodied in the current state vector is typical of POMDPs [2,13]. (The term “state” thus interchangeably refers to a state of the system and to the current probability distribution vector relative to the various states of the system. The distinction between these two uses will have to be made according to context.) To clarify, the order of things in the decision process is the following. At time t, the system is in an unknown state X(t) = i. Based on π(t – 1), the matrices P and the

I. David et al.

Simple algorithm for system maintenance

29

vectors q(·, D(t – 1)), a decision is taken, D(t) = k, k = 1,…, K. Instantaneously, the system switches to (unknown) state j according to Pijk , an immediate average reward or cost q(i, k) is gained and the observation process provides a signal Y(t) = y. The decision maker then updates the state probabilities by (3). The process repeats itself indefinitely. Focusing on the total discounted reward criterion, we wish to optimize the decisions so as to maximize  ∞  E β t q(X (t), D(t))     t =0 



(4)

for all initial probability vectors π(0). (It is assumed that the process is initiated with a known probability distribution over the system states.) The dependence on π(0) has been dropped from the notation in (4), and 0 < β < 1 is a discount factor. Appendix A considers a number of applications in the context of maintenancey reliability and makes explicit the present model for several signal distributions and reward structures. 3.

The proposed heuristic

Optimal decisions for the above-stated POMDPs are known to be stationary (dependent on the current state probability vector, but not on time). The heuristic rule that we suggest is stationary, it is extremely simple and it runs as follows: (a)

Solve the fully observable problem by Howard’s policy iteration and value determination routine. That is, find d *(i) for all 1 ≤ i ≤ n, where d *(i) stands for the Howard’s optimal decision assuming state i is fully observed, d(i) ∈D. As an output of the algorithm, one gets the Vi values, the expected total discounted reward starting in state i, i ∈X. Notably, n   Vi = max  q(i, k) + β Pijk V j  . (5) k   j =1  



(b)

For the current state vector π(t), calculate an approximate reward functional V˜ (π( t)):  n n      V˜ (π( t)) = max  π i (t)  q(i, k) + β Pijk Vj  | k = 1,…, K  (6)    i =1   j =1 





and set D(t) = k˜ = arg max V˜ (π(t)). This algorithm is at the same time widely applicable and very simple. Note that it requires only a single run of Howard’s routine. Underlying this procedure is a pretense that future states would be completely revealed and acted upon optimally.

30

I. David et al.

Simple algorithm for system maintenance

Equally for each decision, present unknown states are then weighted by the mass probability vector of current belief. For the benefit of simplicity and generality, our framework does not include any indication of observation quality, which is a drawback, since the algorithm produces the same rules for all observation structures given (t). Still, it can be shown by examples, and it is conjectured to hold true generally, that the better the quality of observation, the better the quality of the proposed algorithm. (For more on the notion of utilizing the quality of data in the context of productionyreplacement, see [12].) 3.1. The case of uniform observations, a two-state two-action model Assume an observation structure and a reward structure for the uniform observation two-state two-action machine replacement problem as described in example A4 (appendix A), and set s = r. For this case, analysis of equations (5)–(6) of the new proposed heuristic leads to an explicit decision rule. We have Theorem 1. The new approximate algorithm calls for a control limit policy which continues production iff π ≥ π˜ , where π is the probability of the GOOD state and 0 ≤ π˜ ≤ 1 is a threshold value. π˜ is given explicitly by where κ ≡ Ky(m2 – m1).

1 − κy(r + κ βr),

(7)

Note that the threshold depends on the monetary parameters of the model only through the ratio of the investment K to the marginal revenue (m2 – m1). Cf. [4]. Proof. Letting Vi = max{q(i, k) + β ∑2j=1 Pijk Vj | k = 0 (CON), 1(REP)}be the Howard values, and plugging equations (1) and (2) of appendix A, we get V1 = max{m1 + βV1, − K + c + β((1 − r)V1 + rV2 )} ,

(8a)

V2 = max{c + β((1 − r)V1 + rV2 ), − K + c + β((1 − r)V1 + rV2 )} = c + β((1 − r)V1 + rV2 ),

(8b)

where c = m1 + r (m2 – m1). V1 and V2 are easily solvable for given numerical values of the model parameters. The scalar π defines the current vector state, so that

˜ (π ) = arg max{(1 − π , π) q( ⋅ , k) + β Pk (V1, V2 )|k = 0,1} D

(9)

(see (6)). By the linearity of the comparison in (9), it is obvious that this is a control limit value in π. Specifically, using the appropriate substitutions, (9) calls for CON iff

I. David et al.

Simple algorithm for system maintenance

31

(1 − π ) (m1 + βV1 ) + π V2 ≥ − K + V2

or

π ≥1−

K ≡ π˜ . V2 − (m1 + βV1 )

(10)

Using (8a)–(8b), we have V2 = (c + β(1 − r)V1)y(1 − βr)

(11)

V1 = max{m1 + βV1, − K + V2}.

(12)

and Again using (11), V1 = max{m1 + βV1 , − K + (c + β(1 − r)V1)y(1 − βr)}. It now takes only simple algebra to see that the rightmost expression may be substituted for V1 only if κ ≤ ry(1 – βr) (otherwise the machine will be too expensive to ever replace), and that in this case, substituting V2 – (m1 + βV1 ) into (10) gives the result (7). u Note that the rule (9) is independent of the model parameter p, the length of the intersection between the observation intervals, which supplies a clear indication in this case for the quality of data. A smaller p implies an a priori better distinction between GOOD and BAD states. This is reflected in theorem 2 below, which may be proven with the help of (1) and (2) of example A4 in appendix A, and of (1) and (2) of appendix B. Here, the x’s are the π-values multiplied by r, and V is the pertinent value function for using an arbitrary control-limit x (see appendix B). Theorem 2. For all x and x , V (x) is decreasing in p. Numerical evidence (see subsection 4.1 below) shows that V (x)yV *(x) (the efficiency ratio) is also decreasing in p for all x and x, but x˜ of (7) still provides excellent approximations to the optimal control value x *. Furthermore, we have Theorem 3. For p = 0, x˜ ≡ r π˜ = x*. The result follows easily by (7) and by theorem B1 of appendix B using the condition κ ≤ ry(1 – βr) (see proof of theorem 1). 3.2. Binomial observations, a two-action machine replacement problem For the special case of binomial observations with 2 actions and n states, we illustrate here the straightforward implementation of our heuristic according to equations (5)–(6). Assume an observation and reward structure as in example A1 of appendix A, with two actions (e.g. continueyreplace machine). Suppose there are n = 10 system states characterized by

32

I. David et al.

α1 = 0,

Simple algorithm for system maintenance

α2 = 0.20, α3 = 0.30, α 4 = 0.40,

α 5 = 0.50,

α6 = 0.60, α 7 = 0.70, α8 = 0.80, α9 = 0.90, α10 = 0.95. Let

{P10j }7j =1 = 0.5, 0.2, 0.1, 0.05, 0.05, 0.05, 0.05, {P20j }7j = 2 = 0.6, 0.2, 0.05, 0.05, 0.05, 0.05, {P30j }7j = 3 = 0.6, 0.2, 0.1, 0.05, 0.05, {P40j}8j = 4 = 0.7, 0.1, 0.1, 0.05, 0.05, {P50j }9j = 5 = 0.7, 0.1, 0.1, 0.05, 0.05, {P60j }9j = 6 = 0.7, 0.2, 0.05, 0.05, {P70j }9j = 7 = 0.8, 0.1, 0.1, {P80j }10 j = 8 = 0.8, 0.1, 0.1, {P90j }10 j = 9 = 0.9, 0.1, {P100 j}10 j =10 = 1, Pi11 = 1 for all i,

and all other entries of these transition matrices be set to zero. Let the cost parameters be c1i1 = 2, c 1i0 = 0, c2 = 0.05 and c3 = 0.01, and a discount factor β = 0.9. The resultant expected immediate reward vectors are q(0) = [0.392, 0.326, 0.275, 0.221, 0.161, 0.113, 0.062, 0.005, − 0.043, − 0.070], q(1) = [− 1.5, − 1.5, −1.5, − 1.5, − 1.5, −1.5, − 1.5, −1.5, − 1.5,

− 1.5],

Using Howard’s algorithm, V1 = 1.454, V2 = 1.296, V3 = 1.057, V4 = 0.804, V5 = 0.51, V6 = 0.331, V7 = 0.127, V8 = –0.105, V9 = –0.191, V10 = –0.191. For π(t) = (0.4, 0.2, 0.2, 0.1, 0.1, 0, 0, 0, 0, 0), according to the new algorithm V˜ (π) = max{0.578, –0.191} = 0.578, and thus the optimal decision is to continue (not to replace, k˜ = 0). The same decision comes out by the index heuristic previously proposed in [11], cf. appendix C, where α˜ = ∑10 i = 1 π i α i = 0.19 and α c = 0.85, being the midpoint between α 8 = 0.8 and α 9 = 0.9. (In the fully observed solution, the system is not replaced for states 1 through 8, while for states 9 and 10 the system is replaced.) If, however, π(t) = (0, 0, 0, 0, 0, 0, 0.05, 0.3, 0.3, 0.35), the new heuristic again implies to continue since V˜ (π) = max(–0.182, –0.191) = –0.182, while the index algorithm recommends the opposite, since α˜ (π) = 0.88 > α c .

I. David et al.

4.

Simple algorithm for system maintenance

33

Evalutating the quality of the algorithm

The present procedure suggests obvious upper bounds for the optimal value function over the belief space. This fact is rather intuitive by the interpretation given above, and formally follows from the results in [15]. The question of lower bounds, however, remains open. Our own findings for the quality of the algorithm regarding this special case are summarized below. 4.1. Efficiency ratios for the uniform observation model Theorems 1, B1 and B2 of appendix B allow one to determine analytically the efficiency ratio of the present heuristic, relative to the optimal solution, for any combination of parameter values of the uniform observation model. In the presentation below, we follow the notation used in appendix B. The data presented in table 1 refer to β = 0.9 exclusively, to save space. Furthermore, we scale the monetary values so that m1 = 1 (the average revenue from the machine in the BAD state). For various combinations of the three parameters r (the Table 1 Worst-case efficiency ratios for the uniform data replacement model.a) r

m2

K

πmin

0.9

2 10

2 1 5 10 20

0.5

2

10

a) b)

0.01 0.5 1 5 8 0.01

˜x

x*

Kmax

U*

RV

RU

0.5 (–) b) 0.5 0.5 0.2 0.2

0.186 0.799 0.530 0.344 0.159 –

0.332 0.799 0.623 0.488 0.298 0.298

4.7 42.6

13.5 90.0 66.1 53.9 36.9 36.9

0.921 1.000 0.974 0.926 0.885 – 4.430

0.979 1.000 0.978 0.940 0.916 – 2.957

(–) 0.2 0.2 (–) 0.2 0.0 0.2

0.490 – 0.155 0.399 0.130 0.006

0.490 0.490 0.120 0.390 0.100 0.003 0.499

0.9 0.9 0.9 8.2

14.9 14.9 11.2 50.0 19.0 10.3 54.9

1.000 0.290 0.984 1.000 0.893 0.982 0.288

1.000 0.767 0.998 1.000 0.949 0.999 0.417

8.2

( x = 0.9) ( x = 0)

( x = 0)

The discount factor is β = 0.9 all over. Where π min is not specified, the ratios prevail for all p and π, and the x * and U * are representative values.

passage probability to a GOOD state), m2 (the average revenue from the machine in a GOOD state) and K (the replacement cost), we let π = 0, 0.2, 0.5, 0.8, 1.0 and p = 0, 0.3, 0.5, 0.7, 1.0 and computed the efficiency ratios RV ≡ V˜ (x)yV * (x) and RU ≡ ˜ (x)yU *(x), where x = π r. In the table, these ratios are presented, along with some U

34

I. David et al.

Simple algorithm for system maintenance

values of interest, only for p = 1 (worst case for p) and for the value of π (denoted π min ) which gives the lowest ratios for given r, m2 and K. Ten representative examples of combinations for r, m2 and K are shown in the table. The value Kmax which appears in the table refers to the value of K which gives κ = ry(1 – βr). If K > Kmax , replacement never occurs under both the heuristic and the optimal rule. For the sake of illustration and comparison, a few results concerning extreme control policies of x = 0 (always continue production) and x = 0.9 (frequent replacement) are appended to the table. These results appear in boldface, and located between the results for the policies according to the respective r, m2 and K. (Obviously, x˜ is not specified for these cases.) We may note that RV may be given the interpretation of the ratio between the improvements of the heuristic and the optimal rule, respectively, over the gain of the trivial policy that never replaces, when in the BAD state. Since this gain is a lower bound for the optimal gain, it also follows that the denominator of RV is positive (while the nominator may be negative, see e.g. table 1). All in all, it seems difficult to find plausible combinations under which the efficiency ratio RV drops below 0.9. This is partly due to the assumption s = r (a new machine deteriorates as fast as an old one), which is pertinent to the analysis here and in [4]. Still, the clear outcome of the numerical study is that the new heuristic is very good ad hoc. 4.2. Statistical comparison to the index heuristic for binomial observations The new heuristic has been tested statistically by simulation runs, compared to the index heuristic [11] (see appendix C). This latter heuristic was originally tailored for the specific two-action model outlined in example A1 below, where a machine producing batches of N items is imagined, and in state i the number of defective items is binomially distributed with parameter α i . The states are thus partially observed via the number of defectives. For i = 1,…, n – 1, α i < α i +1 so that the states are ordered and the machine deteriorates probabilistically. As in [4], two actions, CON and REP, are possible and the objective is to maximize the net profit. The index heuristic [11] was empirically proven more efficient and as accurate as Lovejoy’s general algorithm for this model. In order to compare between the two algorithms, we performed extensive simulations of the above model. In the experimental design, similar to Sinuany-Stern et al. [11], we considered production systems with n = 3, 5, 10 states, β = 0.9, 0.95, and 27 combinations of Pij0 and the vector α (see [11]). There were 162 cases, where each case was sampled in m simulation runs for an appropriate time length. m varied between 30 to100 according to standard t-test considerations, so all in all, over 12,000 runs were conducted. Table 2 summarizes results pertinent to the comparison between the two algorithms. Letting qs1 and qs2 be the average long-run reward by the index heuristic and

I. David et al.

Simple algorithm for system maintenance

35

Table 2 Summary of simulation runs for the binomial case. Type of comparison

Number of states

Percentage of cases where

n=3 n=5

4% 0%

4% 0%

4% 0%

qs1 ¿ qs2

n = 10

0%

4%

2%

Total

1%

3%

2%

n=3 n=5 n = 10

7% 11% 11%

11% 4% 15%

9% 8% 13%

Total

10%

10%

10%

n=3 n=5 n = 10

– 4% 2% – 10%

– 3% – 3% – 15%

– 3.5% – 0.5% – 12.5%

Total

– 4%

– 7%

– 5.5%

Percentage of cases where qs1 À qs2 Average A = (qs1 – qs 2)yqs1 in percentage

Discount factor β = 0.9 β = 0.95

Total

the new one respectively, we used an appropriate t-test for testing H0 : E[qs1] = E[qs2] against the alternative E[qs1] ≠ E[qs2] with 0.10 significance level. For this level, both algorithms performed equally in 88% of the cases. Only in 2% of the cases did the index heuristic come out better (see the upper part of table 2). In the remaining 10% of the cases, the new algorithm was significantly better (see the middle part of the table). Overall, statistically the new heuristic is not worse. Additionally, we compared the average relative value A = (qs1 – qs2)yqs1 for the two algorithms for the various cases, which indicated that, overall, the new algorithm is better. More details are given in table 2. In the table, the entry denoted by qs1 À qs2 relates to the upper tail of the rejection area of the above t-test, while qs1 ¿ qs2 relates to the lower tail. 5.

Summary and conclusions

In this paper, we suggested a heuristic algorithm for the machine replacement or maintenance problem which is capable of handling the general POMDP model with finite number of states and decisions. The problem is discounted and is in infinite horizon. The proposed algorithm is not restricted to the case of ordered states (cf. [11]) and it applies for any probability distribution of the signal data. The algorithm was checked in this work against two special-case machine-maintenance solution schemes that have been evaluated in the literature [4,11], and was found very efficient, at least as simple as the existing schemes for their respective applications, and of course substantially more general.

36

I. David et al.

Simple algorithm for system maintenance

Acknowledgements Both referees of this paper gave instructive criticism and helpful comments. We are indebted to one of the referees particularly for pointing out the potential in the series of works by C.C. White and his colleagues for evaluating bounds for the present design, and to the second referee for suggesting the analytical test case of Grosfeld-Nir. The work has been partly sponsored by the Paul Ivanier Center for Robotics and Production Management at Ben-Gurion University. Appendix A: Model realization for various distributions Example A1: Binomial distribution Imagine a machine which produces batches of size M at each stage. The machine deteriorates down states i = 1,…, n, characterized by α i , the proportion of good items in state i. At each stage, a sample of size N of items is taken and the number of good items Y(t) = y is recorded. In this case,  N ryi =   α iy (1 − α i ) N − y , y = 0,1, …, N,  y and α iy (1 − αi ) N − y ∑ nj = 1 PjiD(t −1)π j (t − 1) π i (t) = . n n ∑ l =1 α ly (1 − α l )N − y ∑ j = 1 PjlD(t − 1)π j (t − 1) K levels of system maintenance can be conceived of, ranging from no maintenance to complete replacement, with c 1ik denoting the cost of maintenance level k when the system is in state i. Additional parameters are the cost due to defective item c2 and the net profit from a good item c 3. In this example, n

q(i, k) =

∑ Pijk [c3 Mα j − M(1 − α j )c2 ] − c1ik . j=1

(Cf. [11] for the background to this model in the literature and for treatment of the case K = 2, actions being CON or REP.) Example A2: Poisson distribution Assume a machine or a device which operates over time and undergoes failures and replacements. In a given time interval (checkup period), one counts the number of failures Y(t) = y. Here, ryi =

λiy e − λ i , y = 0,1, 2,…, y!

I. David et al.

Simple algorithm for system maintenance

37

where λ i is the rate of failures in state i multiplied by the checkup time. This Poisson assumption is widely used in quality control, cf. [3, p. 462]. We have π i (t) =

λiy e − λi ∑ nj =1 PjiD(t − 1)π j (t − 1)

n n ∑ l = 1 λlye− λ l ∑ j = 1 PjlD(t −1) π j (t − 1)

.

The cost parameters are the cost of maintenance level k, c 1ik , and the cost of each failure c2. Consequently, n

∑ Pijk c2λ j − c1ik .

q(i, k) = −

j=1

Example A3: Categoric, discretized or tabulated cases Assume that the ryi are simply taken from a given table, y = 1,…, N. If uy represents the net reward from signal y, then n

q(i, k) =

N

∑ ∑ uy ryi Pijk

j =1

− c1ik .

y =1

This setting also covers the case of a categoric variable. The continuous case is easily incorporated too, assigning categories to appropriate partitions of the continuous outcomes. In this manner, the signal may stand for certain revealed attributes of the product such as weight, pressure, size, humidity, temperature, etc. varying over a continuous range. The probabilities ryi for each state i are easily calculable and may serve, in turn, to update π i (t) according to (3a). Letting µ j be the respective conditional expectations, we write n

q(i, k) =

∑ Pijk µj c2 − cik1 , j =1

where c2 is the unit attribute net reward. Example A4: Uniform continuous distribution As in [4], assume the machine is in either a BAD (1) or GOOD (2) state. m1 and m2 are the average revenues in the good and bad states, respectively, and m2 > m1. Actions are CON (do nothing) or REP (renew the machine at cost K > 0). Let  1 PCON =  1 − r

0 , r 

1 − s PREP =  1 − s

Let c(x) = (1 – x)m1 + xm 2 = m1 + x(m2 – m1). Here,

s . s

38

I. David et al.

Simple algorithm for system maintenance

q(1, CON) = m1 , q(1, REP) = − K + c(s), q(2, CON) = c(r),

(A.1)

q(2, REP) = − K + c(s). We assume a single state-dependent signal at each stage, with f ( y|1) = I[0,1] ( y), f ( y |2) = I [1– p, 2 – p] ( y), where the I’s are indicator functions. Thus, both signals are uniform on intervals of unit length, with an intersection interval of length p. The GOOD outcome is stochastically larger than the BAD one. If we write π (in short) for π 2 (t – 1) (the probability for the GOOD state), and the observation is y, 0 ≤ y ≤ 2 – p, equation (3b) gives 0 ≤ y ≤ 1 − p, 0  π r 1 − p ≤ y ≤ 1, D(t − 1) = CON,  π 2 (t) =  (A.2) 1 − p ≤ y ≤ 1, D(t − 1) = REP, s 1 < y.  1 Appendix B: Grosfeld-Nir analytical two-state model with uniform observations For the two-state machine replacement case of example A4, it is intuitive, also proven correct [1,13], that the optimal policy is of a control-limit type. Thus, the optimal decision is CON iff π, the probability of the GOOD state, exceeds a certain threshold. Complying with [4], we let x = πr (the “information state”), and let x* be the optimal threshold for x. We set s = r (assuming that a new machine deteriorates probabilistically the same as an old one). Defining κ = Ky(m2 – m1), we have Theorem B1 (Grosfeld-Nir). The optimal threshold is given by Step 0 . Calculate C := κy[1 + β(1 – p)κ]; n := 1; A := 1; B := 1; Step 1 . Repeat begin S := (rA – C)yB; if r n +1 ≤ S < r n then x * := S; stop else A := A + (βpr)n; B := B + (βp) n; n := n + 1; end. Let 0 ≤ x ≤ r be a fixed value of the information state. Call the stationary policy that uses x as the control limit an “ x -policy”. Define by U ( x) the expected present value of the discounted total future revenue if x is the current information state, and an x -policy is undertaken, and let V (x) = [U (x) − m1y(1 − β)]y(m2 − m1 ). Further,, define nx = min{n|r n x ≤ x } and δ = V (r) . On the grounds of [4], the following is a significant extension:

I. David et al.

Simple algorithm for system maintenance

39

Theorem B2. For any x -policy and an information state x, 0 ≤ x ≤ r, if we define DN = (β p) N + β(1 − p)(1 − (β p)N )y(1 − β p), or any N ≥ 0, then

E N = (1 − β(1 − p)κ )(1 − (β pr ) N )y(1 − β pr )  x Enx + (δ − κ )Dnx V (x) =  δ −κ

x > x, x ≤ x.

(B.1)

The value δ = V (r) is given by δ = (rE nr − κ Dnr )y(1 − Dnr ).

(B.2)

Proof. Letting D1 = [0, 1], D2 = [1 – p, 2 – p] and D = D1 > D2 , we have P(Y ∈D2yD| x) = x(1 – p), and P(Y ∈D1yD| x) = x(1 – x)(1 – p), and P(Y ∈D| x) = xp + (1 – x)p = p. The updated information states are, respectively, r, 0 and rx, see (A.2). Thus, by the usual dynamic programming argument,  x + β[(1 − p) xV (r) + pV (rx ) + (1 − p)(1 − x)V (0)] CON, V (x) =  (B.3) REP.  −κ + V (r) By the normalizing condition κ < ry(1 – βr), at x = 0 one replaces the machine so the upper line in (B.3) becomes x + β [ pV (r x) − (1 − p)(1 − x)κ + (1 − p)V (r)]. Letting n = nr = min{N| r N +1 ≤ x}, we have V (r) = r + β(1 − p)V (r) − β(1 − p) (1 − r)κ + β pV (r 2 ), V (r 2 ) = r 2 + β(1 − p)V (r) − β(1 − p) (1 − r 2 )κ + β pV (r 3 ), M V (r n ) = r n + β(1 − p)V (r) − β(1 − p) (1 − r n )κ + β p(−κ + V (r)). Thus, by simple induction, V (r n − k ) = (1 + β pr + L + (β pr ) k )r n − k [1 + β(1 − p)κ] + (1 + β p + L + (β p) k )β(1 − p) [V (r) − κ ] + (β p)k +1[V (r) − κ]. Setting k = n – 1 and using δ = V (r) and the notation in the theorem, we get δ = En r + Dn (δ – κ), which gives (B.2). In a similar fashion, V (x) = x + β(1 − p)V (r) − β(1 − p)(1 − x)κ + β pV (r x), V (rx ) = r x + β(1 − p)V (r) − β(1 − p)(1 − r x)κ + β pV (r 2 x), M V (r n x) = r n x + β(1 − p)V (r) − β(1 − p) (1 − r n x)κ + β p(− κ + V (r)), this time for n = nx . As before, this gives (B.1).

u

40

I. David et al.

Simple algorithm for system maintenance

For theorems B1 and B2 to hold, κ < ry(1 – βr) must be assumed, else x * = 0 and replacement never happens. See the proof of theorem 1 of section 3. Appendix C: Sinuany-Stern et al. index heuristic for a machine replacement model with binomial observations For the binomial distribution (example A1), Sinuany-Stern et al. [11] provide the following heuristic algorithm. (a)

Solve the fully observable problem, e.g. by Howard’s routine. That is, find d *(i) for all 1 ≤ i ≤ n. (d *(i) stands for the Howard’s optimal decision for state i. It equals either 1 or 0.)

(b)

Let i * = max{ j|d *( j) = 0} and α c = (α i * + α i * +1 )y2. (Parts (a) and (b) are conducted only once for a given process.)

(c)

At any time t, let α *(π) ≡ α *(π(t)) = ∑ni=1 α i π i (t).

(d)

If α *(π(t)) ≥ α c , D(t) = 1 (replace), else D(t) = 0. For the rationale behind this heuristic, see [11].

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

C.S. Albright, Structural results for partially observable Markov decision processes, Oper. Res. 27 (1979)1041–1053. D.P. Bertsekas, Dynamic Programming and Stochastic Control, Academic Press, New York, 1976. A.J. Duncan, Quality Control and Industrial Statistics, Irwin, Homewood, IL, 1986, 5th ed. A. Grosfeld-Nir, A two-state partially observable Markov decision process with uniformly distributed observations, Oper. Res. 44(1996)458–463. S.H. Kim and J.B. Ho, A partially observable Markov decision process with lagged information, J. Oper. Res. Soc. 38(1987)439–446. D.E. Lane, A partially observable model of decision making for fishermen, Oper. Res. 37(1989)240– 254. W.S. Lovejoy, A survey of algorithmic methods for POMDPs, Ann. Oper. Res. 28 (1991)47–65. W.S. Lovejoy, Computationally feasible bounds for POMDPs, Oper. Res. 39(1991)162–175. G.E. Monahan, A survey of POMDPs: Theory, models and algorithms. Manag. Sci. 28(1982)1–16. Z. Sinuany-Stern, Replacement policy under partially observed Markov process, Inter. J. Prod. Econ. 29(1993)159–166. Z. Sinuany-Stern, I. David and S. Biran, An efficient heuristic for a partially observable Markov decision process of machine replacement, Computers Oper. Res. 24(1997)117–126. C.C. White, A Markov quality control process subject to partial observation, Manag. Sci. 23(1977) 843–852. C.C. White, Optimal control limit strategies for a partially observed replacement problem, Int. J. Sys. Sci. 10(1979)321–331. C.C. White, A survey of solution techniques for the POMDP, Ann. Oper. Res. 32(1991)215–230. C.C. White and D.P. Harrington, Application of Jensen’s inequality for adaptive suboptimal design, J. Optim. Theory Appl. 32(1980)89–99. C.C. White and W.T. Scherer, Finite-memory suboptimal design for Partially Observed Markov Decision Processes, Oper. Res. 42(1994)439–455. D.J. White, Real applications of Markov decision processes, Interfaces 15(1985)7–83. D.J. White, Further real applications of Markov decision processes, Interfaces 18(1988)55–61.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.