Accuracy of judgmental extrapolation of time series data
Descrição do Produto
International Journal of Forecasting 14 (1998) 95–110
Accuracy of judgmental extrapolation of time series data Characteristics, causes, and remediation strategies for forecasting a a b, Eric Welch , Stuart Bretschneider , John Rohrbaugh * b
a Center for Technology and Information Policy, The Maxwell School, Syracuse University, Syracuse, NY 13244 -1090, USA Department of Public Administration and Policy, Rockefeller College of Public Affairs and Policy, University at Albany ( SUNY), Albany, NY 12222, USA
Accepted 30 June 1997
Abstract This paper links social judgment theory to judgmental forecasting of time series data. Individuals were asked to make forecasts for 18 different time series that were varied systematically on four cues: long-term levels, long-term trends, short-term levels, and the magnitude of the last data point. A model of each individual’s judgment policy was constructed to reflect the extent to which each cue influenced the forecasts that were made. Participants were assigned to experimental conditions that varied both the amount of information and the forecasting horizon; ‘‘special events’’ (i.e. discontinuities in the time series) also were introduced. Knowledge and consistency were used as measures of the judgment process, and MPE and MAPE were used as measures of forecast performance. Results suggest that consistency is necessary but not sufficient for the successful application of judgment to forecasting time series data. Information provided for forecasters should make long-term trends explicit, while the task should be limited to more immediate forecasts of one or two steps ahead to reduce recency bias. This paper provides one method of quantifying the contributions and limitations of judgment in forecasting. 1998 Elsevier Science B.V. Keywords: Judgmental Forecasting; Experiment; Social Judgment Theory
1. Introduction Virtually all organizations must forecast in order to function effectively. Over 30 years ago forecasting was asserted to be perhaps the most amenable of all problem-solving processes in organizations to automation (Turban, 1972). Subsequent surveys on the diffusion and use of the methods of management science in organizations suggested that forecasting applications led all others in both industry and *Corresponding author.
government use (Ledbetter and Cox, 1977; Mentzer and Cox, 1984; Fildes and Ranyard, 1997). Considerable effort has been devoted to the development of new optimizing techniques that might reduce forecast bias and forecast variance even further (Armstrong, 1985; Makridakis et al., 1984; Meese and Geweke, 1984; Collopy and Armstrong, 1992; Fildes, 1992; Hill et al., 1994). While no single approach to forecasting has been found that is globally optimal, methods can be matched with specific circumstances to identify locally optimal approaches. For example, one important finding in numerous comparative
0169-2070 / 98 / $19.00 1998 Elsevier Science B.V. All rights reserved. PII S0169-2070( 97 )00055-1
96
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
forecasting studies is that simple extrapolative time series models often outperform complex causal models for near-term events (Armstrong, 1985; Makridakis et al., 1993). This is partly a result of overfitting causal models to historical samples, then forecasting ex ante, and partly due to problems in predicting inputs or exogenous variables for these models (Ashley, 1983). While organizations may be well justified in working from time series data alone in forecasting near-term events, they typically do not rely on the predictions that extrapolative time series models provide. Most forecasts still are based largely on judgment or incorporate considerable post-model adjustment by managers (Mentzer and Cox, 1984; Dalrymple, 1987; McNees, 1990; Turner, 1990). Human judgment remains the most frequently used forecasting method for several reasons. A mathematical model may be viewed as a ‘‘black box’’ with mechanisms not fully understood or controlled relative to one’s own assessment of a situation (Langer, 1975; Kleinmuntz, 1990). Often a forecast is not intended strictly as an objective extrapolation but rather intermixed with concerns for setting achievable goals and distributing performance rewards; thus, predictive accuracy may not be an organization’s highest priority (Bretschneider and Gorr, 1991; Bretschneider et al., 1989). Managers also may believe that they are at an advantage because of their access to useful outside (i.e. non-time series) information exogenous to the model (Edmundson et al., 1988; Sanders and Ritzman, 1992). Within the past decade considerably greater attention has been devoted to the role of judgment processes in forecasting, including judgmentally adjusting statistical forecasts and even formally integrating judgmental and statistical techniques to increase forecast accuracy (Bunn and Wright, 1991; Armstrong and Collopy, 1992, 1993; Goodwin and Wright, 1993). Some evidence indicates that judgment can provide improvement over strictly statistical approaches (Lawrence et al., 1985, 1986; Lobo and Nair, 1990; Mathews and Diamantopoulos, 1986, 1989; McNees, 1990; Sanders, 1992; Wolfe and Flores, 1990), but other studies have cautioned that judgment actually may decrease forecast accuracy (Armstrong, 1986; Carbone et al., 1983; Lawrence and O’Connor, 1992; Lim and O’Connor, 1995; Remus et al., 1995; Willemain, 1991).
Such conflicting and unresolved results are indicative of the necessity to better explicate circumstances in which judgment processes may be helpful or harmful to forecasting. In fact, when Goodwin and Wright (1993, p. 158) completed their review of this literature, they concluded: ...there is a need for much greater understanding of the cognitive processes adopted in judgmental forecasting tasks, in order to design improved forecasting support systems and strategies. Much work has been carried out so that we might develop an understanding of the cognitive processes involved in decision making..., and it may be useful to explore the extent to which this has implications for forecasting. To date, existing work on judgment in forecasting has barely connected with the psychological literature on cognitive processes. For example, although recent work (Lawrence and Makridakis, 1989; Lim and O’Connor, 1995; O’Connor et al., 1993; Sanders, 1992) has documented aggregate judgmental differences across specific task conditions that forecasters commonly face, these studies did not explore individual judgment policies. Somewhat more pertinent is the research of Lawrence and O’Connor (1992) in which the first effort was made at constructing a ‘‘bootstrap’’ model—a regression model of the judge based on time series cues—for a forecasting task. This work was limited in large part due to the attempt to model a single, common policy for all forecasters rather than exploring what might be important individual differences between participants regarding the manner in which time series information was used. Because individual forecasters may approach their assigned task in distinct ways, empirical methods should be explored that can help to characterize explicitly the alternative judgment policies in use. Such individual judgment analysis (Cooksey, 1996), of course, does not preclude the investigation of intra-group similarities and inter-group contrasts in cognitive processes, as the present study will illustrate. However, research capacity at the individual level of analysis to identify differences between participants, unlike strictly group-level studies, also can advance both forecasting theory and practice by determining the extent to which certain heuristics
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
and biases appear prevalent (i.e. to establish what proportion of individuals exhibit specific judgmental weaknesses or strengths). Knowledge about the extent to which heuristics and biases affect forecasting outcomes can help forecasters to both produce better methods of forecasting and to improve the quantification of judgment in forecasting tasks.
2. The research problem The purpose of the present study is to develop a research strategy that for the first time directly connects the psychological study of human judgment with the substantial forecasting literature on the extrapolation of time series data. Rather than studying the pattern of judgments for undifferentiated participant groups in response to dichotomous distinctions in task characteristics (see, for example, Eggleton, 1982; Lawrence and Makridakis, 1989; Sanders, 1992), our plan was to elicit a large number of forecasts from each individual for multiple time series that systematically vary in a number of salient characteristics. This empirical project had three main objectives. The first objective was to increase understanding of the cognitive bases of inaccuracy associated with judgmental extrapolation of time series data. In particular, the present study was designed to assess the extent to which forecast error and forecast bias might be explained by the levels of knowledge, the match between the models of environment and forecaster, and consistency, the reliability of information processing (environmental predictability, fidelity of the information system, and reliability of information acquisition were not considered explicitly), that characterize the forecasting (Stewart and Lusk, 1994). Here knowledge refers more narrowly to recognizing the essential properties of a prediction task and to applying such insight with suitable skill when a forecast is required. In subsequent study of judgment based on Brunswik’s (1956) lens model (Tucker, 1966; Castellan, 1973; Stewart, 1976), this skill component has been designated as G, the correlation over a set of forecasts between predictions from a model of the forecaster (i.e. a social judgment model of the individual) and predictions from a model of the environment (i.e. true outcomes). The reliability
97
of information processing has been termed consistency (also ‘‘judgmental predictability’’ and ‘‘cognitive control’’). Consistency is a measure of the extent to which forecasting knowledge is used with unvarying rigor—whatever information to be considered is combined in a manner that would yield identical predictions every time identical circumstances occurred. This skill component has been designated as R 2 , the proportion of variation over a set of forecasts (as the criterion measure) that can be accounted for by using the information available to the forecaster (as the predictor measures). The lens model of the four-cue forecasting task used in the present study is illustrated in Fig. 1. The second objective of this study was to assess the impact of several key situational factors that might be expected to influence the forecast process and forecast outcomes. In particular, the present study explored whether (a) access to explicit statistical information, (b) an increase in task demands, and (c) the occurrence of an abrupt change in the time series might alter the forecasting performance observed. Work by Ashton (1991, 1992) and Powell (1991) has suggested that the availability of explicit statistical information in the form of a decision rule will improve the accuracy of an individual’s predictions, though such ‘‘mechanical’’ aids do not appear to be used fully enough to produce maximum performance (also see Arkes et al., 1986, 1987). A study of dynamic decision making (Richardson and Rohrbaugh, 1990) also found that participants performed better when pertinent cues were emphasized clearly but that offering an explicit decision rule (as a weighted formula) interfered significantly with their learning of the task (also see Lim and O’Connor, 1995). The present research was designed, in part, to test whether such differences in aid (e.g. decision rule versus cue values only) would similarly affect accuracy in the judgmental extrapolation of time series data. Although there are a variety of ways of conceptualizing the demands of forecasting tasks such as the data scale, the series length, and tabular versus graphical displays (Goodwin and Wright, 1993), the focus of the present study was on the number of forecasts elicited for each time series: one-step-ahead only or multiple predictions for two or more future periods. The cognitive effect of making multiple—
98
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
Fig. 1. Lens model of the forecasting task with four cues.
rather than one, single-period forecast is not well known. Lawrence and Makridakis (1989), for example, found less dampening of trends (underprediction) in first forecasts than in second forecasts (also see Sanders, 1992). Lawrence and O’Connor (1992) reported less emphasis on the last observation at longer forecast horizons. All participants in each of these studies made exactly the same number of predictions. No study has investigated how or why the requirement for additional forecasting work (twosteps-ahead or three-steps-ahead predictions) may differentially improve or interfere with judgmental performance. The occurrence of an abrupt change in time series data due to ‘‘special events’’ is observed in many situations that managers must confront (Gorr, 1986; Tsay, 1988), but relatively little study has been devoted to the impact of such discontinuities on the accuracy of judgmental forecasts. Recent investigations by O’Connor et al. (1993) and Remus et al. (1995) have concluded that, although individuals appear able to identify and respond to discontinuities in their forecasting, they still underperform simple statistical models. In fact, the available evidence suggests that judgmental forecasts in times of change
may be too sensitive to random fluctuations, resulting in ‘‘ ‘noisy’ behavior’’ (O’Connor et al., 1993). Additional study with a cognitive focus at the individual level of analysis clearly is warranted to test the extent to which consistency, as well as knowledge, are affected by ‘‘special events’’ in time series. Finally, the third objective was to investigate whether context dependence (such as recency), heuristics (such as availability), and biases (such as conservatism) might be explicitly traceable by systematically decomposing forecasting skill into its cognitive elements. Although a great deal of behavioral decision research has been directed at fundamental problems of managerial judgment and choice (see, for example, Baron, 1994; Dawes, 1988; Bazerman, 1994), little attention has been paid to forecasting problems involving time series data (Beach et al., 1986; Hogarth and Makridakis, 1981). It is evident, however, that the heuristic approach managers may take to forecasting could be expected to lead to severe errors. For example, certain aspects of the time series such as long-term trend are less directly retrievable and, therefore, less immediately available for anchoring (Lawrence and O’Connor,
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
1992). Conservatism may lead to systematic underestimation of growth or decline (Lawrence and Makridakis, 1989). A tendency to rely too greatly on the latest information (i.e. a recency effect) and to neglect the value of base rates and larger sample size can increase inaccuracy, as well (Remus et al., 1995).
3. Method of study
3.1. Forecasting task The forecasting task employed in this experimental study was developed to represent typical nonseasonal, linear-trended time series with four key information features: (a) the average of the most recent 12 periods (long-term level); (b) the average growth per period (long-term trend); (c) the average of the most recent three periods (short-term level); and (d) the measurement at the most recent period (the last available data point). Each of these four cues assumed one of three possible values (i.e. visually distinguishable low, medium, and high levels). For example, the three time series displayed in Fig. 2 vary in level of long-term trend from no growth on average (Fig. 2c), to an increase of ten units per period on average (Fig. 2b), and up to an increase of 20 units per period on average (Fig. 2a). Similarly visible differences in the other three cues were also constructed. Altogether nine time series were constructed to produce a completely orthogonal design of uncorrelated cues. It is important to note that the forecasting task was developed in a manner fully consistent with Stewart’s (1988) reasonable prescriptions for the conduct of judgment research within the paradigm of social judgment theory.
3.2. Experimental design The present study was based on a three-factor experimental design, the last factor being a repeated measure. The first factor varied participants’ access to explicit statistical information. Three experimental levels for this factor were devised: (a) no particular cues highlighted; (b) clearly emphasized cues; and (c) an explicit judgment policy. In the first condition, participants viewed only the time series data such as
99
one of the series depicted in Fig. 2. In the second condition, participants were provided with a table of statistical information about the four cues (i.e. longterm trend, long-term level, short-term level, and most recent point) displayed next to each time series. In the third condition, participants were provided with information identical to the second condition, as well as an arithmetically specified forecasting rule, a form of explicit judgment policy for combining the available information. In the present study, these are termed ‘‘amount of information’’ conditions; the differences between the three conditions are depicted in Fig. 3. No experimental manipulation of incentives, feedback, or justification, known to sometimes reduce participants’ reliance on decision rules (Ashton, 1991), was introduced. Participants also were assigned to one of three levels for the second factor that incrementally increased the task demands in the experiment. In the first condition, participants were asked to make only one forecast for each time series: a single prediction for one step ahead. In the second condition, participants were asked to make two forecasts for each time series, both for one step ahead and for two steps ahead. In the third condition, predictions were collected from participants three times, for one step ahead, for two steps ahead, and for three steps ahead on each time series. The need to produce additional forecasts in the second and third conditions required somewhat more thought and more time. However, only the first forecast (one step ahead) made by every participant was incorporated into the analyses reported here. A repeated measure was introduced by asking all participants to make forecasts for two sets of the nine time series. In the first set, no change in level or trend was shown within any time series. In the second set, initially identical to the first, an abrupt change or discontinuity occurred that affected both level and trend following the seventh period. This ‘‘special event’’ produced a visually distinguishable gain in both long-term level and long-term growth. All participants, regardless of the conditions to which they were assigned, were asked to forecast the next three steps ahead for this second set of time series data. Furthermore, no information other than the time series itself was made available to any participants.
100
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
Fig. 2. Three illustrative time series varying in long-term trend. (a) Average growth per period: 20. (b) Average growth per period: 10. (c) Average growth per period: 0.
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
101
Fig. 3. Differences between experimental conditions in amount of information. (a) No particular cues highlighted. (b) Clearly emphasized cues. (c) Explicit judgment policy.
In summary, the design contained a task of nine distinct time series with systematic variation on two factors: the amount of information and the number of steps ahead predicted. To this was added an additional nine time series requiring all participants to forecast three steps ahead without any special information provided. The latter time series contained the ‘‘special event’’ treatment. It should be emphasized that, regardless of experimental condition, all participants made forecasts for the full set of 18 varied time series data.
3.3. Experimental procedure To reduce the potential for introducing external forms of bias, the laboratory study was computer automated in its presentation and data collection. Each of the nine combinations of conditions was organized as an independent spreadsheet application and maintained on separate software disks. Each application consisted of a single file of a self-contained workbook comprised of multiple worksheets. Each participant was scheduled to arrive individually at the experimental site, was given one of the nine disks and instructed to open the file. All participants
were volunteers and no rewards or incentives were offered to them. The first three worksheets contained a set of instructions, a number of demographic questions, and a practice exercise that was to be completed before the actual forecasting task was undertaken. All sheets, including the 18 time series, were designed to be one screen in size and locked. Every effort was made to represent graphically the time series data to reduce known experimental artifacts including adequate spacing above the plot, appropriate scaling, and limited length of the series (Lawrence and Makridakis, 1989; Lawrence and O’Connor, 1992). Because level of performance is known to vary based on direction of trend (Remus et al., 1995), the present study attempted to control this effect by using upward and level series only.
3.4. Participants in the study The research involved a group of 38 participants (23 males and 15 females) selected from two populations of graduate students in equivalent programs of public management in two New York universities. All participants had enrolled in at least two graduate courses devoted to the study of statistics and / or
102
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
management science. The average age of the participants was 27, and they averaged 4.5 years of employment experience.
3.5. Dependent measures: forecast process and forecast outcome Two sets of dependent measures were distinguished in the present study: forecast process measures and forecast outcome measures. Consistent with the conduct of judgment analysis, forecast process was measured both by judgment consistency (R 2 ) and task knowledge (G). Additionally, attention was given to measuring the relative importance or weight that each participant appeared to attach to the four cues that were systematically varied in the experimental design: long-term level, long-term trend, short-term level, and last available data point. The task was constructed so that the two cues longterm level and trend (or, as in Fig. 3, ‘‘average’’ and ‘‘growth’’), when added together, would forecast exactly the one step ahead data point at period 16. In short, the unstandardized regression coefficients for long-term level and trend were both 1.0; R 2 for the task was 1.0. Because the variability of long-term trends was greater than of long-term levels across the time series, the standardized regression coefficients (betas) for long-term trend and level were approximately 0.80 and 0.60, respectively. It was anticipated that participants would differ across experimental conditions with respect to the level of their consistency and knowledge, as well as the profile of their four cue weights. For example, a recency effect would be demonstrated to the extent that participants placed weight on the last available data point. The availability heuristic would be demonstrated by participants placing lower weight on long-term trend when they themselves must try to retrieve it from the time series (i.e. when the average growth statistic is not displayed). Forecast outcome was measured both with mean absolute percentage error (MAPE) and with mean percentage error (MPE). MAPE is an overall measure of forecast accuracy, computed from the absolute differences between a series of forecasts and actual data points observed. Here the actual values are defined as the expected future value based on the long-term level and trend information. Each absolute
difference is expressed as a percentage of each actual data point, then summed and averaged. MPE is an overall measure of forecast bias, computed from the actual differences between a series of forecasts and actual data points observed; each difference is expressed as a percentage of each observed data point, then summed and averaged.
4. Results
4.1. Correlations between measures of forecast process and forecast outcome To examine the results, measures of consistency (R 2 ), knowledge (G), accuracy (MAPE) and bias (MPE), were generated for each participant on the first and for the second nine series forecasted. The correlations between each of these measures by the two groups of time series are presented in Table 1. It is interesting to note that the two measures of forecast process, R 2 and G, were not highly correlated for either set of nine time series (r 5 0.28 and 0.33, respectively); only about 10% of the variance was shared. Similarly, the two measures of forecast outcome, MAPE and MPE, also were not highly correlated for either set of nine time series (r 5 2 0.37 and 0.23, respectively). Only in the first set did greater absolute error appear to be associated significantly ( p , 0.05) with bias, in particular, with underestimation or conservatism. Although the two forecast process measures appeared unrelated to MPE, they were correlated significantly ( p , 0.05) with MAPE for both sets of nine time series. Consistency and absolute error were the more strongly correlated (r 5 2 0.64 for both sets), but knowledge and absolute percentage error only somewhat less so (r 5 2 0.53 and 2 0.56, respectively). The implication of these significant relations between the two forecast process measures and MAPE is given considerably greater attention below.
4.2. Forecast process measures: consistency ( R 2) The judgment models based on the four information cues produced to account for variability in
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
103
Table 1 Correlations between dependent measures (calculated for both sets of time series data) Consistency R2
Knowledge G
Consistency R2
0.28 0.33
Knowledge G
0.28 0.33
Mean Absolute Percentage Error
Mean Percentage Error
2 0.64 2 0.64
0.47 2 0.02
2 0.53 2 0.56
0.25 0.19
Mean Absolute Percentage Error
2 0.64 2 0.64
2 0.53 2 0.56
Mean Percentage Error
0.47 2 0.02
0.25 0.19
2 0.37 0.23
Mean
0.94 0.90
0.94 0.93
10.1 7.7
0.2 2 0.1
Standard Deviation
0.06 0.11
0.04 0.05
3.4 2.3
4.6 3.5
forecasts appeared to be well specified. The mean R 2 for all of the models was 0.92; over 80% of the R 2 statistics exceeded 0.85. As reported below, the judgment models for the first set of time series data showed greater consistency than for the second set with mean R 2 statistics of 0.94 and 0.90, respectively. Forecast consistency (as all the dependent measures) was tested in a 3 3 3 3 2 analysis of variance (ANOVA) with one repeated measure. Table 2 shows that a significant main effect was found for amount of information (F[2, 58] 5 7.82, p , 0.01) but not for number of steps ahead (F[2, 58] 5 2.30). A significant main effect also was found for the
2 0.37 0.23
repeated measure (F[1, 58] 5 4.47, p , 0.05). Only one interaction was significant: amount of information x number of steps ahead (F[4, 58] 5 2.64, p , 0.05). The mean consistency statistics for the three amount of information conditions shown in Table 3 indicate that time series displays without summary information (mean R 2 statistics of 0.90 for the first set and 0.86 for the second set) were associated in a posteriori tests with lower consistency levels than the other two information conditions. The significant interaction (amount of information x number of steps ahead) is illustrated clearly in Fig. 4 for the second set of time series data. Participants receiving time
Table 2 F-ratios and significance tests for key analysis of variance results Dependent Measures Consistency R2
Knowledge G
Mean Absolute Percentage Error
Main Effects Amount of Information (A) Number of Steps Ahead (B) Repeated Measure-Set (C)
7.82*** 2.30 4.47**
5.20*** 11.37*** 1.59
15.36*** 4.53** 18.05***
Interactions A3B A3C B3C A3B3C
2.64** 0.32 0.92 1.21
2.26* 0.35 0.54 1.21
3.13** 1.29 0.45 0.63
Note: * p , 0.10; ** p , 0.05; *** p , 0.01.
104
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
Table 3 Arithmetic means produced by varying experimental conditions Dependent Measures Consistency R2 Set (C): Amount of Information (A) no cues highlighted clearly emphasized cues explicit judgment policy Number of Steps Ahead (B) one step ahead two steps ahead three steps ahead Grand Means
Knowledge G
Mean Absolute Percentage Error
One
Two
One
Two
One
Two
0.90 0.96 0.96
0.86 0.94 0.92
0.92 0.95 0.94
0.92 0.93 0.94
12.4 8.6 9.0
8.9 6.9 7.0
0.93 0.95 0.94
0.90 0.93 0.87
0.94 0.95 0.91
0.95 0.94 0.89
9.5 9.7 11.3
7.8 7.0 8.5
0.94
0.90
0.94
0.93
10.1
7.7
Fig. 4. Graphical display of results (second set of time series). (a) Consistency (R 2 ). (b) Long-term trend (component of G). (c) Mean absolute percentage error (MAPE).
series displays without summary information and required to make forecasts for three steps ahead are shown in Fig. 4 to have produced sharply less consistent forecasts (Mean 5 0.68).
4.3. Forecast process measures: knowledge ( G) The four informational dimensions of the time series data presented to the participants were iden-
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
tified as orthogonal cues for the judgment modeling: long-term trend, long-term level, short-term level, and last available data point. In this study, a participant’s cue weights from their individual social judgment model were measured by calculating the proportion of the total variance explained (R 2 ) in the judgment model that could be associated independently with each of these four predictors. In general, long-term trend (Mean 5 0.60) and long-term level (Mean 5 0.30) were the two cues most directly associated with the observed individual forecasts; short-term level (Mean 5 0.00) and last available data point (Mean 5 0.10) were less influential cues. Cue weights remained relatively stable across the two sets of time series data. However, as reported below, the cue weights for long-term trend in the first set of time series data were higher than for the second set (Mean 5 0.62 and Mean 5 0.57, respectively). Overall knowledge (G) of the correct cue weights was tested separately in a 3 3 3 3 2 ANOVA with one repeated measure. Table 2 shows that a significant main effect was found both for amount of information (F[2, 58] 5 5.20, p , 0.01) and for number of steps ahead (F[2, 58] 5 11.37, p , 0.01); the main effect for the repeated measure was nonsignificant. Only one interaction was significant: amount of information x number of steps ahead (F[4, 58] 5 2.26, p , 0.10). To identify more precisely the source of the experimental effects on participants’ knowledge, a 3 3 3 3 2 ANOVA with one repeated measure was conducted for each of the four cues that served as components of G. (By calculating for each cue the ratio of the squared difference between cue utilization and cue validity to a measure of the participant’s consistency. The greater the discrepancy between utilization and validity (relative to consistency), the smaller the G.) Only the cue component related to long-term trend differed significantly across experimental conditions. A significant main effect was found both for amount of information (F[2, 58] 5 10.03, p , 0.01) and for number of steps ahead (F[2, 58] 5 6.46, p , 0.01). A significant main effect also was found for the repeated measure (F[1, 58] 5 3.41, p , 0.10). Only one interaction was significant: amount of information x number of steps ahead (F[4, 58] 5 4.26, p , 0.01). These results specifically for
105
long-term trend were strikingly parallel to the results overall for G. The mean components of G for the three amount of information conditions indicated that time series displays without summary cue information were associated in a posteriori tests with more inaccurate use of long-term trends than the other two information conditions. The mean components for the three number of steps ahead conditions indicated that forecasts required for three steps ahead were associated with more inaccurate use of long-term trends than the other conditions, that is, forecasts for only one or two steps ahead. The significant interaction (amount of information x number of steps ahead) is illustrated clearly in Fig. 4 for the second set of time series data. Participants receiving time series displays without summary information and required to make forecasts for three steps ahead are shown in Fig. 4 to have produced sharply inaccurate use of long-term trends.
4.4. Forecast outcome measures: forecast error ( MAPE) One measure of forecast outcome used in the present study was the mean absolute percentage error (MAPE). The MAPE found across all nine experimental conditions was 8.9%. As reported below, the MAPE for the first set of time series data showed greater error than for the second set with MAPE statistics of 10.1% and 7.7%, respectively. Forecast error was tested in a 3 3 3 3 2 ANOVA with one repeated measure. Table 2 shows that a significant main effect was found both for amount of information (F[2, 58] 5 15.36, p , 0.01) and for number of steps ahead (F[2, 58] 5 4.53, p , 0.05). A significant main effect also was found for the repeated measure (F[1, 58] 5 18.05, p , 0.01). Only one interaction was significant: amount of information x number of steps ahead (F[4, 58] 5 3.13, p , 0.05). MAPE statistics for the three amount of information conditions shown in Table 3 indicate that time series displays without summary information (MAPE of 12.4% for the first set and 8.9% for the second set) were associated in a posteriori tests with greater error than the other two information conditions. The MAPE statistics for the three number of
106
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
steps ahead conditions also shown in Table 3 indicate that forecasts required for three steps ahead (MAPE of 11.3% for the first set and 8.5% for the second set) produced greater error than the other conditions, that is, forecasts for only one or two steps ahead. The significant interaction (amount of information x number of steps ahead) is illustrated clearly in Fig. 4 for the second set of time series data. Participants receiving time series displays without summary information and required to make forecasts for three steps ahead are shown in Fig. 4 to have produced sharply greater errors (MAPE of 11.8%).
4.5. Forecast outcome measures: forecast bias ( MPE) The second outcome measure assessed possible bias in participants’ forecasts: mean percentage error (MPE). Although the average MPE found across all nine experimental conditions was virtually zero (Mean 5 0.1), considerable variation existed (SD 5 4.3). Participants’ MPE statistics ranged from 2 12.7 to 10.0; over 20% of the participants produced MPE statistics that were below 2 5.0 or above 5.0. Forecast bias was tested in a 3 3 3 3 2 ANOVA with one repeated measure. No significant main effects or interactions were found. Participants did not appear to differ in conservatism or excessiveness.
the second set of time series decreased their consistency (knowledge remained approximately the same), but MAPE statistics actually were lower than in the first set. (Because the actual data points observed in the second set of time series following the abrupt change were greater than in the first set, total absolute errors in the second set that were equal in magnitude to total absolute errors in the first set would produce a lower MAPE statistic in the second set. Although the mean MAPE statistic was significantly lower in the second set, total absolute errors were significantly higher on average in the second set (F[1, 58] 5 13.52, p , 0.01)). The significant interaction of these effects was particularly pronounced for participants simultaneously in conditions with no cues highlighted (i.e. the least amount of information) and forecasting three steps ahead (i.e. the greatest amount of work). Even for the second set of time series data, these participants produced a mean consistency (R 2 ) as low as 0.68, a mean knowledge (G) as low as 0.83, and a MAPE as high as 11.8 (for the first set higher still: 15.9). In contrast, participants simultaneously in conditions with clearly emphasized cues (i.e. the middle amount of information) and forecasting expected for two steps ahead (i.e. the middle amount of work) produced a mean consistency (R 2 ) as high as 0.96, a mean knowledge (G) as high as 0.95, and a MAPE as low as 5.9 for the second set of time series data.
5.2. Context dependence, heuristics, and biases 5. Discussion
5.1. Overview of results All three experimental factors had significant effects on forecast process and forecast outcome, but in somewhat different ways. Increasing the amount of information by emphasizing the four cues improved both forecast consistency and knowledge, leading to lower MAPE statistics. The opposite effect was found in increasing the number of steps ahead for which forecasts were elicited: decreased forecast knowledge (consistency remained approximately the same), leading to higher MAPE statistics. The abrupt change that participants encountered in
The use of judgment analysis in the present study allowed for a more careful assessment of an individual’s context dependence, heuristics, and biases than has been possible in previous studies of the extrapolation of times series data. For example, although Lawrence and O’Connor (1992, p. 24) concluded that there was statistically significant evidence of recency in the forecasts they observed, they were not able to decompose each respondent’s judgments in a manner that could isolate a recency effect. Here, over 20% of the individual participants were identified as producing a cue weight of at least 0.15 on the last available data point (Mean 5 0.10), a sizable recency effect; cue weights ranged from 0.00 to 0.33. As reported above, this recency effect was
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
evoked most strongly in the experimental condition that required forecasts for three steps ahead. The tendency to rely inappropriately on the last available data point appeared to be the primary explanation for participants in this condition generating significantly lower knowledge (G) scores and, consequently, significantly higher MAPE statistics. Because the long-term trend is somewhat difficult to retrieve from a graphic display of time series data, participants in the experimental condition lacking explicit statistical information did not receive a precise indication of periodic growth. In conditions where long-term trend information was more directly available, the evidence demonstrates that participants made significantly greater use of this cue. The tendency to underutilize long-term trend information in making forecasts appeared to be the primary explanation for participants in the condition without summary cue information (see Fig. 3) generating significantly lower knowledge (G) and consistency (R 2 ) scores and, consequently, significantly higher MAPE statistics. In this study, the use of the availability heuristic with respect to long-term trend was advantageous to forecasting accuracy. On the whole, participants in the present study displayed virtually no forecast bias as measured by the MPE statistic (Mean 5 0.1). Furthermore, none of the experimental conditions appeared to arouse a tendency toward either latent conservatism or excessiveness. In the first set of nine time series, correlational evidence (see Table 1) appeared to suggest that participants who were less consistent in their forecasts tended to underestimate actual data points (r 5 0.47); similarly, the larger the MAPE statistic, the stronger was the conservatism bias observed (r 5 2 0.37). However, this pattern of significant correlations disappeared in the second set of nine time series, making any conclusion about forecast bias on the basis of this study highly speculative.
5.3. The relation of forecast process and forecast outcome Because the present study decomposed individual forecasting skill into the components of knowledge and consistency, it was possible to link the forecast process with forecast outcomes. For example, for the second set of time series data, four of the most
107
skilled participants produced a MAPE of under 5%. To accomplish this, they consistently used the two correct cues—long-term trend and long-term level— in the correct way: weighting them approximately 0.65 and 0.35, respectively. In contrast, five participants produced a MAPE of over 10%. This poor performance was linked to two problems: their inconsistency and incorrect use of cues. Two of these inaccurate forecasters simply were too inconsistent to produce good results (R 2 statistics of 0.83 and 0.74). Two other poor performers had high consistencies (R 2 statistics of 0.99 and 0.92) but systematically underutilized by about 60% the long-term trend information, mistakenly relying substantially on the last available data point (cue weight of approximately 0.25). The fifth individual with a MAPE of over 10% suffered from both inconsistency (R 2 of 0.66) and cue weight problems. The connection between the R 2 and MAPE statistics across experimental conditions is evident graphically by comparing Fig. 4a and 4c which both pertain to the second set of time series data. The R 2 —MAPE correlation at this group level of analysis was 2 0.96 (n 5 9 cells); for the first set, the parallel correlation was 2 0.89. The association of forecasting consistency and forecasting error was somewhat less strong at the individual level of analysis; as shown in Table 1, the correlation was 2 0.64 (n 5 38 participants). In short, forecasting consistency was a necessary but not sufficient condition for individuals to minimize forecasting error. What was the additional contribution of knowledge (G) to the explanation of variability in forecasting error? G was relatively independent of R 2 as a second predictor (r 5 0.33) but significantly correlated with MAPE (r 5 2 0.56). The multiple R produced by jointly using the forecast process measures of consistency and knowledge to predict the extent of forecasting error was 0.74. As Stewart and Lusk (1994) have shown, conditional / regression bias and unconditional / base rate bias also can affect observed forecasting skill. Supplementary correlational analyses indicated that conditional / regression bias was correlated significantly with MAPE (r 5 0.59) in the present study and, as a third predictor, increased the multiple R to 0.89. (Conditional / regression bias indicates whether the standard deviations of the forecasts are appropriately reduced to
108
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
account for a less than perfect correlation. This skill component was uncorrelated with either G (r 5 2 0.04) or R 2 (r 5 2 0.16).) Unconditional / base rate bias, which reflects the match between the mean of the forecasts and the mean of the observed data points, was not found to be related to MAPE (r 5 0.22). (Unconditional / base rate bias and MPE were highly correlated (r 5 0.96).) The generalizability of the present study is limited, of course, by the experimental nature of the project and the limited forecasting experience of the graduate students who served as participants. The data presented were relatively simple, linear-trended, nonseasonal time series with one level of noise (approximately 10% MAPE). Forecast environments typically are more complex than represented here and forecasters far more knowledgeable of their substantive domains. Although the cues constructed as the basis for this particular task appeared to account for much of the variability in participants’ predictions, they should not be taken as the only useful way of constructing the events and processes of concern here. No amount of empirical support for this particular definition of cues can logically jeopardize the validity of alternative approaches to the judgmental extrapolation of time series data.
6. Conclusions Forecasting error as measured by the MAPE statistic is decreased in large part by making predictions that reflect both knowledge and consistency. Knowledge and consistency are not fixed (or constants) for individual forecasters, however. Forecasting conditions clearly can be manipulated to influence the observed levels of knowledge and consistency, leading either to greater or lesser accuracy depending upon the situation. The present study suggests that making statistical information about long-term levels and trends in time series data explicitly available to forecasters will improve both knowledge and consistency and, therefore, significantly decrease error. This appears to occur because the forecaster is able to use these statistics as cues for a more appropriate anchoring and adjustment process. In actual forecasting situations, level and trend information might be provided,
as in this experiment, within a prominent ‘‘help’’ window superimposed on the graphical display of the time series data. When level and trend information is less directly available, forecasters appear to be led ‘‘to predictable biases through inappropriate selection of the anchor and insufficient adjustment’’ (Lawrence and O’Connor, 1992, p. 17). The evidence here clearly indicates that the tendency to over-rely on the last available data point can attenuate forecasting performance. In particular, forecasters appear to be more susceptible to a recency effect when forecasting for multiple periods, such as three steps ahead or more. If predictions are demanded for more distant time horizons, explicit long-term level and trend information, rather than the most recent data, should be made as salient as possible. Ample evidence has been compiled previously and supported by this study that providing an explicit decision rule does not enhance (and perhaps interferes with) problem solving and judgment making. The reasons for participants tending not to rely fully on such formal models in experimental settings may be the same as in organizational settings, especially a hesitancy to sacrifice personal discretion for a mechanistic formula. The decision rule offered in this experiment, if followed exactly, would have produced both maximum knowledge and consistency for one step ahead forecasts. Fewer than half of the participants receiving the decision rule for the first set of time series used it to near optimum levels (i.e. both G and R 2 . 0.95). Further research on the use of mechanical aids certainly is warranted. Perhaps individuals would rely more heavily on the recommendations of a decision rule if they were involved in the development and design of the aid. Participants who were provided with the decision rule in this study had neither explanation of its meaning nor justification of its value. Subsequent investigation could manipulate what participants are told about the aid’s development and previous track record. Because decision rules have the potential to increase significantly knowledge and consistency in forecasting, attention to these issues is a critically important area for research. Certainly the condition / regression bias also has an important influence on forecasting accuracy that was
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
not thoroughly explored in the present study. The type of overconfidence (or underconfidence) that affects conditional bias is pervasive and most extreme in tasks of greater cognitive difficulty (Lichtenstein et al., 1982). This bias was evident in the sizable variance observed in participants’ forecasts investigated here and proved to be a clear source of their forecasting error. In future studies, not only knowledge and consistency but conditional and unconditional bias, as well, should define a larger scope of work in establishing the relation of cognitive processes to forecasting outcomes.
References Arkes, H.R., Christensen, C., Lai, C., Blumer, C., 1987. Two methods of reducing overconfidence. Organizational Behavior and Human Decision Processes 39, 133–144. Arkes, H.R., Dawes, R.M., Christensen, C., 1986. Factors influencing the use of a decision rule in a probabilistic task. Organizational Behavior and Human Decision Processes 37, 93–110. Armstrong, J.S., 1985. Long Range Forecasting: From Crystal Ball to Computer, 2nd ed. Wiley, New York. Armstrong, J.S., 1986. The Ombudsman: Research of forecasting: A quarter century review 1960–1984. Interfaces 16, 89–109. Armstrong, J.S., Collopy, F., 1992. The selection of error measures for generalizing about forecasting methods: Empirical comparisons. International Journal of Forecasting 8, 69–80. Armstrong, J.S., Collopy, F., 1993. Causal forces: Structuring knowledge for time series extrapolation. Journal of Forecasting 12, 103–115. Ashley, R., 1983. On the usefulness of macroeconomic forecasts as inputs to forecasting models. Journal of Forecasting 2, 211–223. Ashton, R.H., 1992. Effects of justification and a mechanical aid on judgment performance. Organizational Behavior and Human Decision Processes 52, 292–306. Ashton, R.H., 1991. Pressure and performance in accounting decision settings: Paradoxical effects of incentives, feedback, and justification. Studies on Judgment Issues in Accounting and Auditing, Journal of Accounting Research 28, 148–180. Baron, J., 1994. Thinking and Deciding, 2nd ed. Cambridge University Press, New York. Bazerman, M.H., 1994. Judgment in Managerial Decision Making, 3rd ed. Wiley, New York. Beach, L.R., Barnes, V.E., Christensen-Szalanski, J.J., 1986. Beyond heuristics and biases: A contingency model of judgmental forecasting. Journal of Forecasting 5, 143–157. Bretschneider, S.I., Gorr, W.L., 1991. Economic, organizational, and political influences on biases in forecasting state tax receipts. International Journal of Forecasting 7, 457–466.
109
Bretschneider, S.I., Gorr, W.L., Grizzle, G., Klay, E., 1989. Political and organizational influence on the accuracy of forecasting state government revenues. International Journal of Forecasting 5, 307–319. Brunswik, E., 1956. Perception and the Representative Design of Psychological Experiments, 2nd ed. University of California Press, Berkeley. Bunn, D., Wright, G., 1991. Interaction of judgmental and statistical forecasting methods: Issues and analysis. Management Science 37, 501–518. Carbone, R., Anderson, A., Corriveau, Y., Corson, P.P., 1983. Comparing for different time series methods the value of technical expertise, individualized analysis, and judgmental adjustment. Management Science 29, 559–566. Castellan, N.J., 1973. Comments on the ‘‘lens model’’ equation and the analysis of multiple cue judgment tasks. Psychometrika 38, 87–100. Collopy, F., Armstrong, J.S., 1992. Rule-based forecasting: Development and validation of an expert systems approach to combining time series extrapolations. Management Science 38, 1394–1414. Cooksey, R. W., 1996. Judgment Analysis: Theory, Methods, and Applications. Academic Press, New York. Dalrymple, D.J., 1987. Sales forecasting practices: Results from a United States survey. International Journal of Forecasting 3, 379–391. Dawes, R.M., 1988. Rational Choice in an Uncertain World. Harcourt Brace Jovanovich, New York. Edmundson, R., Lawrence, M., O’Connor, M., 1988. The use of non-time series information in sales forecasting: A case study. Journal of Forecasting 7, 201–211. Eggleton, I.R.C., 1982. Intuitive time-series extrapolation. Journal of Accounting Research 20, 68–102. Fildes, R., 1992. The evaluation of extrapolative forecasting methods. International Journal of Forecasting 8, 81–111. Fildes, R., Ranyard, C., 1997. Success and survival of operational research groups—a review. Journal of the Operational Research Society 4–48, 336–361. Goodwin, P., Wright, G., 1993. Improving judgmental time series forecasting: A review of the guidance provided by research. International Journal of Forecasting 9, 147–161. Gorr, W.L., 1986. Use of special event data in government information systems. Public Administration Review 46, 532– 539. Hill, T., Marquez, L., O’Connor, M., Remus, W., 1994. Artificial neural network models for forecasting and decision making. International Journal of Forecasting 10, 5–15. Hogarth, R.M., Makridakis, S., 1981. Forecasting and planning: An evaluation. Management Science 27, 115–138. Kleinmuntz, B., 1990. Why we still use our heads instead of formulas: Towards an integrative approach. Psychological Bulletin 107, 296–310. Langer, E., 1975. The illusion of control. Journal of Personality and Social Psychology 32, 311–328. Lawrence, M., Edmundson, R., O’Connor, M., 1985. An examination of the accuracy of judgemental extrapolation of time series. International Journal of Forecasting 1, 25–35.
110
E. Welch et al. / International Journal of Forecasting 14 (1998) 95 – 110
Lawrence, M., Edmundson, R., O’Connor, M., 1986. The accuracy of combining judgemental and statistical forecasts. Management Science 32, 1521–1532. Lawrence, M., Makridakis, S., 1989. Factors affecting judgmental forecasts and confidence intervals. Organizational Behavior and Human Decision Processes 43, 172–187. Lawrence, M., O’Connor, M., 1992. Exploring judgemental forecasting, International Journal of Forecasting 8, 15–26. Ledbetter, W.N., Cox, J.F., 1977. Are OR techniques being used? Industrial Engineering 9, 19–21. Lichtenstein, S., Fischhoff, B., Phillips, L.D., 1982. Calibration of probabilities: The state of the art to 1980. In: Kahneman, D., Slovic, P., Tversky, A. (Eds.), Judgment under Uncertainty: Heuristics and Biases. Cambridge University Press, New York. Lim, J.S., O’Connor, M., 1995. Judgemental adjustment of initial forecasts: Its effectiveness and biases. Journal of Behavioral Decision Making 8, 149–168. Lobo, G.J., Nair, R.D., 1990. Combining judgmental and statistical forecasts: An application to earnings forecasts. Decision Sciences 21, 446–460. Makridakis, S., Andersen, A., Carbine, R., Fildes, R., Hibon, M., Lewandowski, R., Newton, J., Parzen E., Winkler, R., 1984. The Forecasting Accuracy of Major Time Series Methods. Wiley, New York. Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T., Ord, K., Simmons, L., 1993. The M2-competition: A real time judgmentally based forecasting study. International Journal of Forecasting 9, 5–29. Mathews, B.P., Diamantopoulos, A., 1986. Managerial intervention in forecasting: An empirical investigation of forecast manipulation. International Journal of Research in Marketing 3, 3–10. Mathews, B.P., Diamantopoulos, A., 1989. Judgmental revision of sales forecasts: A longitudinal extension. Journal of Forecasting 8, 129–140. McNees, S.K., 1990. The role of judgment in macroeconomic forecasting accuracy. International Journal of Forecasting 6, 287–299. Meese, R., Geweke, J., 1984. A comparison of autoregressive univariate forecasting procedures for macroeconomic time series. Journal of Business and Economic Statistics 2, 191– 200. Mentzer, J.T., Cox, J.E., 1984. Familiarity, application and performance of sales forecasting techniques. Journal of Forecasting 3, 27–36.
O’Connor, M., Remus, W., Griggs, K., 1993. Judgmental forecasting in times of change. International Journal of Forecasting 9, 163–172. Powell, J.L., 1991. An attempt at increasing decision rule use in a judgment task. Organizational Behavior and Human Decision Processes 48, 89–99. Remus, W., O’Connor, M., Griggs, K., 1995. Does reliable information improve the accuracy of judgmental forecasts? International Journal of Forecasting 11, 285–293. Richardson, G.P., Rohrbaugh, J., 1990. Decision making in dynamic environments: Exploring judgments in a system dynamics model-based game. In: Borcherding, K., Larichev, O.I., Messick, D.M. (Eds.), Contemporary Issues in Decision Making. North-Holland, Amsterdam. Sanders, N.R., 1992. Accuracy and judgmental forecasts: A comparison. OMEGA: The International Journal of Management Science 20, 353–364. Sanders, N.R., Ritzman, L.P., 1992. The need for contextual and technical knowledge in judgmental forecasting. Journal of Behavioral Decision Making 5, 39–52. Stewart, T.R., 1976. Components of correlations and extensions of the lens model equation. Psychometrika 41, 101–120. Stewart, T.R., 1988. Judgment analysis: Procedures. In: Brehmer B., Joyce, C.R.B. (Eds.), Human Judgment: The SJT View. North-Holland, Amsterdam. Stewart, T.R., Lusk, C.M., 1994. Seven components of judgmental forecasting skill: Implications for research and the improvement of forecasts. Journal of Forecasting 13, 579–599. Tsay, R., 1988. Outliers, level shifts, and variance changes in time series. Journal of Forecasting 7, 1–20. Tucker, L.R., 1966. A suggested alternative formulation in the developments by Hursch, Hammond, and Hursch, and by Hammond, Hursch, and Todd. Psychological Review 71, 528– 530. Turban, E., 1972. A sample survey of operations research activities at the corporate level. Operations Research 20, 708– 721. Turner, D., 1990. The role of judgment in macroeconomic forecasting. Journal of Forecasting 9, 315–346. Willemain, T.R., 1991. The effect of graphical adjustment on forecast accuracy. International Journal of Forecasting 7, 151– 154. Wolfe, C., Flores, B., 1990. Judgmental adjustment of earnings forecasts. Journal of Forecasting 9, 389–405.
Lihat lebih banyak...
Comentários