Two-Pass Cusum to Identify Age-Cluster Outbreaks

May 29, 2017 | Autor: Ross Sparks | Categoria: Econometrics, Statistics, Public Health

Descrição do Produto

Australian & New Zealand Journal of Statistics Aust. N. Z. J. Stat. 52(3), 2010, 245–260

doi: 10.1111/j.1467-842X.2010.00580.x

TWO-PASS CUSUM TO IDENTIFY AGE-CLUSTER OUTBREAKS

ROSS SPARKS CSIRO Mathematics, Informatics & Statistics Summary The paper introduces a two-pass adaptive cumulative sum (CUSUM) statistic to identify age clusters (age grouping) that significantly contribute to epidemics or unusually high counts. If epidemiologists know that an epidemic is confined to a narrow age group, then this information not only makes it clear where to target the epidemiological effort but also helps them decide whether to respond. It is much easier to control an epidemic that starts in a narrow age range of the population, such as pre-school children, than an epidemic that is not confined demographically or geographically.

Key words : describing outbreaks; monitoring; Poisson counts; public health; surveillance.

1. Introduction In public health surveillance, epidemics are generally observed to start in specific locations such as schools, where disease impacts a narrow age group. To detect and combat epidemics, we need tools and graphical techniques to describe the nature of outbreaks and how they move with time. For example, we need tools to determine: (i) who the epidemic is attacking – which age groups, which sex, which ethnic groups; (ii) where the outbreak is located – which hospitals, which locations; (iii) what adverse effects the epidemic has on those who contract the disease. Wartenberg (2001) indicated that most people would agree that cases of disease often cluster. He recommended using a method for detecting clusters that is efficient for expected clustering behaviour in preference to trying a sequence of untargeted methods. Fleming, Ducatman & Shatlat (1992) discussed clusters that are workplace-related. The current literature has a focus on detecting geographical clusters (e.g. Elliott 1995; Elliott, Martuzzi & Shaddick 1995; Elliott & Wakefield 2001; Kulldorff 2001). This paper, however, provides a methodology that helps epidemiologists identify whether epidemics involve an age-group cluster, by identifying which age groups contribute significantly to signalled epidemics. Hospital data can be analysed separately to determine whether the age influence on epidemics is significantly different across hospitals. This amounts to assessing whether the epidemic is behaving similarly at all hospitals. Konty & Farzad Mostashari (2007) and Gould, Wallenstein & Kleinman (1990) used the scan statistic with age as the one-dimensional spatial variable to detect clusters. Woodall et al. (2008), however, pointed out the advantages of using cumulative sum (CUSUM) plans in preference to scan statistics for flagging unusual counts, and therefore this paper bases its identification technology on the CUSUM statistic. ∗ Author to whom correspondence should be addressed.

CSIRO Mathematics, Informatics & Statistics, Locked Bag 17, North Ryde, NSW 1670, Australia. e-mail: [email protected] Acknowledgements . The author would like to thank Rob McGregor from ClearInfo and the journal editors for their editorial help with this paper. C

2010 Australian Statistical Publishing Association Inc. Published by Blackwell Publishing Asia Pty Ltd.

246

ROSS SPARKS

Section 2 of the paper discusses the distributional assumptions made for disease incidents. Subsequent sections propose the use of a two-pass CUSUM plan for identifying age clusters. Most epidemics start in population clusters, and their early detection is essential for optimal control. In this paper, ‘clusters’ are regarded as instances where counts are significantly higher than expected. Inference on the significance of clusters is difficult and plagued by multiple comparisons. The proposed two-pass CUSUM plan has the additional advantage of obviating such corrections. The method is designed to reliably detect the epicentre (defined as the middle age group of the cluster) of an age-related outbreak. Simulations are carried out to determine the properties of the two-pass CUSUM plan. The paper ends by making recommendations about the broader applicability of the method and its potential as a stand-alone surveillance tool where diseases are known to often start in age clusters. 2. Distributional assumptions Assume that populations are locally stationary and that disease incidences for people of age j (denoted Y j,t ) have a mean incidence given by λ j,t for day t. Counts Y j,t are assumed to be Poisson-distributed. In addition, assume that the summation of these counts over several ages also follows a Poisson distribution with mean equal to the summation of the λ j,t s. In other words, we assume that the means are correlated across ages and across time within ages, but that the departures of counts from these means are independently distributed. If the epidemic has a seasonal influence, then the means may differ for each day according to the season of the year, but the mean for each time period is just the equivalent sum of the daily means. For example, if the total incidence, Yt , on day t is found by summing across all g ages g involving independent Poisson random variables Y j,t , then it has a mean of λt = j=1 λ j,t . The probability that Yt = y is

Pr(Yt = y | λ) = e−λt λt /y!, y = 0, 1, 2, . . . . y

If Yt exceeds a certain threshold value, then it signals an epidemic. The c control chart (the Shewhart chart for Poisson counts; see Shewhart 1939) is traditionally used for this purpose. An alternative is to use the CUSUM charts (Lucas 1985) for homogeneous Poisson counts. Epidemics could involve an age cluster and not influence other age groups. The paper now describes the nature of such age-clustered epidemics. 3. Identifying age clusters that significantly contribute to an epidemic For populations with 80 potential age groups there are 80 × 81/2 = 3240 sets of ‘neighbouring’ ages as potential clusters, that is, 80 of size 1, 79 of size 2, 78 of size 3, . . . , 1 of size 80. This number of multiple comparisons for identifying significant age clusters demands a correction for multiple testing. Bonferroni corrections result in conservative inferences. In addition, disaggregating comparisons to the lowest level of age reduces the potential to detect epidemics spanning several age groups. A sequential testing approach based on Page (1954)’s CUSUM over age groups in sequence is an alternative plan that does not suffer the ‘conservative’ tag. C

2010 Australian Statistical Publishing Association Inc.

IDENTIFYING AGE-CLUSTER OUTBREAKS

247

4. The CUSUM and two-pass CUSUM clustering algorithm The CUSUM approach can detect whether one or more ages contribute significantly to an epidemic. The CUSUM statistic (Lucas 1985) used for homogeneous Poisson counts (i.e. λt = λ for all t) is St = max(0, St−1 + Yt − k), where k = (c − 1)λ/(log(cλ) − log(λ)) = (c − 1)λ/ log(c)

(1)

for some suitable constant c > 1. This CUSUM plan is designed to be optimal for detecting changes in mean from λ to cλ if the change always starts when the CUSUM statistic is zero (usually referred to as the ‘zero state’). Traditionally, the CUSUM statistic signals whenever it exceeds a threshold. Moreover, this threshold is designed to have an acceptable or manageable false alarm rate. That is, a signal is given whenever St > h(λ, c), where h(λ, c) is the appropriate threshold for a target false alarm rate. For detecting age clusters, the CUSUM summation is across the ages to flag unusual age-cluster outbreaks. Means are not homogeneous across ages. Therefore an adaptive version of the above CUSUM statistic is needed; that is, a plan that adjusts for the changing means across age groups. Let Yq,t be the count for the qth age group on day t with respective mean λq,t . Then the adaptive CUSUM (ACUSUM) is achieved by cumulatively summing standardized scores as follows: ASq,t = max(0, ASq,t−1 + (Yq,t − kq,t )/h(λq,t , c)), and signalling whenever ASq,t ≥ 1 (see Sparks et al. 2010). Note that kq,t = (c − 1)λq,t / log(c). If h(λq,t , c) values are selected to give a specified false alarm rate when means are treated as homogeneous, then the ACUSUM plan mentioned above will have approximately the same false alarm rate as the specified false alarm rate. Usually the CUSUM statistic is reset to zero after a signal. In this paper, however, we threshold the ASq,t statistic at 1 after a signal. This means that the next age group after a signal also signals if and only if Yq,t ≥ kq,t . Thus it is easier for age groups to signal directly after ASq,t ≥ 1. However, using the CUSUM statistic, the clustering of neighbouring ages in the direction of increasing ages causes cluster boundaries to be biased on the high side. To counter this, we apply the CUSUM in both directions (the directions of increasing and decreasing ages), and only cluster ages if the CUSUM exceeds its threshold in both directions. In other words, the forward CUSUM is defined by F0,t = 0 and Fq,t = max(0, Fq−1,t + (Yq,t − kq,t )/h(λq,t , c)), q = 1, 2, . . . , g. Similarly, the backward CUSUM is defined by Bg+1,t = 0 and Bq,t = max(0, Bq+1,t + (Yq,t − kq,t )/h(λq,t , c)), q = g, g − 1, . . . , 1. C

2010 Australian Statistical Publishing Association Inc.

248

ROSS SPARKS

TABLE 1 A simple example demonstrating the application of the two-pass CUSUM Age group (q ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 g = 20

Count yq,t

Expected age-group count λq,t

kq,t

h(λq,t , c)

Fq,t

Bq,t

3 6 2 4 3 1 3 4 3 8 4 7 3 3 4 1 0 0 2 1

3.5 3.5 3.0 3.0 3.0 3.0 2.5 2.5 2.5 2.5 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0

4.9 4.9 4.2 4.2 4.2 4.2 3.5 3.5 3.5 3.5 2.8 2.8 2.8 2.8 2.1 2.1 2.1 2.1 1.4 1.4

6.9 6.9 6.7 6.7 6.7 6.7 6.4 6.4 6.4 6.4 6.2 6.2 6.2 6.2 5.8 5.8 5.8 5.8 5.2 5.2

0.00 (6–4.9)/6.9 = 0.16 0.00 0.00 0.00 0.00 0.00 (4–3.5)/6.4 = 0.07 0.00 (8–3.5)/6.4 = 0.70 0.7 + (4–2.8)/6.2 = 0.90 0.9 + (7–2.8)/6.2 > 1.01 1.01 + (3–2.8)/6.2 > 1.01 1.01 + (3–2.8)/6.2 > 1.01 1.01 + (4–2.1)/5.8 > 1.01 1.01 + (1–2.1)/5.8 = 0.82 0.82 + (0–2.1)/5.8 = 0.46 0.46 + (0–2.1)/5.8 = 0.10 0.10 + (2–1.4)/5.2 = 0.21 0.21 + (1–1.4)/5.2 = 0.13

0.00 (6–4.9)/6.9 = 0.16 0.00 0.28 + (4–4.2)/6.7 = 0.25 0.45 + (3–4.2)/6.7 = 0.28 0.93 + (1–4.2)/6.7 = 0.45 1.01 + (3–3.5)/6.4 = 0.93 0.93 + (4–3.5)/6.4 > 1.01 1.01 + (3–3.5)/6.4 = 0.93 1.01 + (8–3.5)/6.4 > 1.01 1.01 + (4–2.8)/6.2 > 1.01 0.39 + (7–2.8)/6.2 > 1.01 0.36 + (3–2.8)/6.2 = 0.39 0.33 + (3–2.8)/6.2 = 0.36 0.00 + (0–2.1)/5.8 = 0.33 0.00 0.00 0.12 + (0–2.1)/5.8 < 0.00 (2–1.4)/5.2 = 0.12 0.00

For a given threshold value h(λq,t , c), a signal is given whenever both Fq,t > 1 and Bq,t > 1. Table 1 demonstrates the application of the two-pass CUSUM plan for a simple example with g = 20. The age-group counts are, in order, 3, 6, . . . , 1, with descending means 3.5, 3.5, . . . , 1, respectively. Assume that their respective threshold values are h(λq,t , c) = 6.9, 6.9, . . . , 5.2 for q = 1, 2, . . . 20. The forward CUSUM starts with F0,t = 0, and then calculates in sequence F1,t = max(0, (3 − 4.9)/6.9) = 0, F2,t = max(0, 0 + (6 − 4.9)/6.9) = 0.16, . . . , etc. The forward CUSUM flags a potential clustered outbreak for 12 ≤ q ≤ 15 (i.e. the q-value when Fq,t > 1). The backward CUSUM starts calculating the CUSUM from q = 20; that is, B20,t = max(0, 0 + (1 − 1.4)/5.2) = 0, B19,t = max(0, 0 + (2 − 1.4)/5.2) = 0.12, B18,t = max(0, 0.12 + (0 − 2.1)/5.0) = 0, . . . etc. Note that Bq,t > 1 for q = 8, 10, 11, 12. Because F12,t > 1 and B12,t > 1, this signals an ageclustered outbreak. When an epidemic exists, the two-pass CUSUM approach may not specify the tails of a cluster correctly (as illustrated later by simulation), but it has the advantage of better flagging the epicentre of the age cluster. Therefore, the two-pass CUSUM helps to focus the epidemiological effort at the appropriate age-cluster epicentre. An alternative to the ACUSUM plan described in thissection is first to transform the Poisson counts to standard normal deviates using Z q,t = 2( Yq,t − λq,t ) (Rossi, Lampugnani & Marchi 1999) and then to apply the classical CUSUM of Page (1954) to these standard normal deviates. The potential advantage of this approach is that it removes the need to adjust for heterogeneous means. The CUSUM of the Z q,t plan is designed assuming normality of Z q,t . However, as this normality approximation is not good for low means, this plan is C

2010 Australian Statistical Publishing Association Inc.

IDENTIFYING AGE-CLUSTER OUTBREAKS

249

inappropriate whenever low means are encountered, whereas the adaptive plan is applicable for all non-homogeneous Poisson counts. Applications typically have low means. 5. Application of the two-pass CUSUM cluster algorithm To illustrate the application of the algorithm, we consider three examples, relating to workplace accidents, road crashes and disease epidemics. (i) Workplace accidents: Accidents and injuries in the workplaces of large enterprises over a specified time period (say weekly or monthly) are often monitored. The working life within Australia ranges approximately from 18 to 65. Allowing the grouping of ages in the tails of the distribution leaves roughly 40 age groups. Identifying age-related unusual increases in accidents may be helpful in targeting the safety-training effort designed to improve workplace safety. (ii) Road crashes: Daily counts of road crashes are monitored by the Roads and Traffic Authority in New South Wales, Australia. Drivers can obtain their driving license at 17 years of age and drive until about 85 without being re-tested. Allowing the grouping of ages in the tails of the distribution leaves roughly 60 age groups. Identifying age-related unusual increases in accidents may be helpful in targeting the road safety promotional effort. (iii) Epidemics and public health problems: Although ages range from 0 to 100 years old, the tail distributions of these ages can be grouped into age groups. Identifying age-related epidemics is useful for their control, using either containment or targeted immunization programs. 5.1. Selecting the appropriate value for c The value of c in (1) regulates the memory of counts in the CUSUM plan, with higher values of c retaining less memory. This section examines the influence c has on inference. The details are reported in Sparks (2009) but summarized here. The value of c is selected based on two performance criteria. The first is the chance of detecting any part of the cluster when it exists. The second is identifying all ages in the cluster. In Sparks (2009), known epidemics were simulated for 40, 60 and 80 age groupings with clusters spanning over three, five, seven and nine age groups. The detection properties for c = 1.25, 1.35, 1.40, 1.45, 1.50, 1.55 and 1.6 were assessed. The detailed results are reported in Sparks (2009). In summary, the results indicated that c = 1.40, 1.45 or 1.50 produced the highest detection probabilities for nearly all epidemics spanning three, five, seven and nine age groups. This result was independent of the size of the generated epidemic. The differences between the performances of c = 1.40, 1.45 and 1.50 were found to be small in most cases. For clusters spanning seven, nine or 10 age groups, c = 1.45 produced the highest detection probabilities for smaller increases in means, while c = 1.5 produced the highest detection probabilities for larger increases in means or for clusters spanning fewer than six age groups. The lower values of c were more likely to overspecify the epidemic age cluster, but had a greater chance of correctly identifying all ages in the cluster. Therefore a value of c = 1.4 is advocated for fully specifying the epidemic, but better overall detection properties are realized by selecting c = 1.5.

C

2010 Australian Statistical Publishing Association Inc.

250

ROSS SPARKS

5.2. False alarm properties of the two-pass CUSUM approach when selecting the CUSUM plan to have an in-control average run length (ARL) of 100 Although the detection properties of the CUSUM plan are known, the detection properties for the two-pass CUSUM are unknown. Simulations were used to explore these properties. Means used in the simulation are within the range encountered in health and workplace safety examples (e.g. means seldom go above three per day within age groups and within emergency departments during non-epidemic periods). Tables 2 and 3 summarize the respective estimated probabilities of detecting no cluster for randomly generated in-control counts (i.e. with no outbreak). The simulated examples are described in the first column of Table 2. Notice that the specificity (the probability that no cluster is detected when none exists) drops slightly as c increases. The CUSUM plan for public health applications (i.e. total age groups equal 80) can almost guarantee a specificity of greater than or equal to 0.9 (level of significance of at most approximately 10% – see Table 3). The CUSUM plan for workplace safety applications (i.e. 40 age groups) can guarantee a specificity of greater than or equal to 0.95 (level of significance of at most approximately 5% – see Table 2). The main advantage of the ACUSUM plan is that its specificity is nearly invariant to the values of c, the mean

TABLE 2 Specificity (the probability that no cluster is detected when none exists) – 40 age groups expected in accident monitoring in the workforce c Age-group means were generated as random samples from a uniform distribution on the following interval (0,3) sorted in ascending order (0,1) sorted in ascending order (0.25,1.25) sorted in ascending order (0.5,1.5) sorted in ascending order

1.25

1.35

1.375

1.45

0.951 0.962 0.967 0.968

0.945 0.960 0.966 0.963

Specificity 0.966 0.972 0.973 0.976

0.960 0.965 0.967 0.971

TABLE 3 Specificity (the probability that no cluster is detected when none exist) – 80 age groups expected in public health surveillance c Age-group means were generated as random samples from a uniform distribution on the following interval (0,1) sorted in ascending order (0.25,1.25) sorted in ascending order (0.5,1.5) sorted in ascending order (0,1) sorted to have higher means in the tails and lower values in the centre (0.25,1.25) sorted to have higher means in the tails and lower values in the centre (0.5,1.5) sorted to have higher means in the tails and lower values in the centre (0,2) sorted in ascending order (0,3) sorted in ascending order

C

2010 Australian Statistical Publishing Association Inc.

1.25

1.35

1.375

1.45

0.900 0.909 0.917 0.905

0.898 0.909 0.914 0.896

0.897 0.906 0.913 0.895

0.895 0.903 0.912 0.895

0.904

0.903

0.900

0.895

0.917

0.901

0.900

0.904

0.915 0.900

0.910 0.901

0.912 0.899

0.905 0.895

Specificity

IDENTIFYING AGE-CLUSTER OUTBREAKS

251

TABLE 4 Summary of the characteristics of the simulated outbreaks, with links to the appropriate figure in the paper

In-control counts

Figure 1a 1a 1a 1b 1b 1b 1c 1c 1c

Using 40 age groups with age-group mean counts randomly generated from a U(0, 3), distribution, and sorted in ascending order so that the lower age groups have lower means. Epidemic at age-group cluster

Means for additional counts at the respective age groups in the cluster (increase in means)

Average additional number of incidents

1 to 8 1 to 6 4 to 6 17 to 23 18 to 22 19 to 21 36 to 40 37 to 40 37 to 39

0.5, 1.1, 1.7, 2.3, 3.0, 3.6, 4.1, 4.7 1, 1.7, 2.6, 3.6, 4.6, 5.5 3.4, 4.3, 5.3 3.2, 3.3, 3.4, 3.5, 3.7,3.9 2.4, 3.0, 4.0, 4.6, 5 4.2, 4.4, 4.4 4.0, 4.1, 4.2, 4.3, 4.4 4.6, 4.7, 4.8, 4.9 4.2, 4.4, 4.4

21 19 13 21 19 13 21 19 13

Using 80 age groups with means on the uniform interval [0,2] sorted in ascending order

Figure 2a 2b 2c

Epidemic at age-group cluster

Means for additional counts at the respective age groups in the cluster (increase in means)

Average additional number of incidents

1 to 10 37 to 43 77 to 80

0.4, 0.9, 1.3, 1.7, 2.2, 2.6, 3.1, 3.5, 3.9, 4.4 3.2, 3.3, 3.3, 3.4, 3.5, 3.6, 3.7 5.9, 6.0, 6.0, 6.1

24 24 24

distribution and the order of the means generated from this distribution. The effect of a vector of means with larger values is investigated later in the paper. 5.3. The potential to identify outbreaks caused by known age clusters (sensitivity) The outbreak scenarios were simulated by generating additional counts for a cluster of neighbouring age groups and adding them to the respective in-control age-group counts. Several examples of known outbreaks were simulated as described in Table 4, and then the performance of the two-pass CUSUM plan for detecting these outbreaks was examined. Figure 1 shows results for 40 age-group problems, with the means generated as random values from the U(0, 3) distribution; lower age groups were assigned lower means. Figure 2 displays problems for 80 age groups, with the means randomly generated from the U(0, 2) distribution; lower age groups were assigned lower means. The various scenarios used for generating outbreaks are summarized in the 2nd, 3rd and 4th columns of Table 3, and the performance of the two-pass CUSUM for detecting these clusters is reported in Figures 1 and 2. In Figure 1, an epidemic with on average 22, 19 and 13 additional incidents resulted in having on average 82, 79 and 73 incidents in a day relative to the expected value (mean) of 60 incidents. The probability of having at least 82, 79 and 73 incidents, that is, Pr(Yt ≥ y | λ = 60), equals 0.004, 0.008, 0.044 for y = 82, 79, 73, respectively. The simulations gave the following results. C

2010 Australian Statistical Publishing Association Inc.

ROSS SPARKS

0.8

(a)

0.0

0.4

Cluster age group 1 to 8 Cluster age group 1 to 6 Cluster age group 4 to 6

0

10

20

30

40

30

40

30

40

Age group sequence number

0.8

(b)

0.0

0.4

Cluster age group 17 to 23 Cluster age group 18 to 22 Cluster age group 19 to 21

0

10

20 Age group sequence number

0.8

(c)

0.4

Cluster age group 36 to 40 Cluster age group 37 to 40 Cluster age group 37 to 39

0.0

Estimated probability of signalling

Estimated probability of signalling

Estimated probability of signalling

252

0

10

20 Age group sequence number

Figure 1. Probability of the two-pass CUSUM plan signalling at specific age groups for the 40 age-group example. The dashed vertical lines indicate the boundary of the cluster.

(i) Figure 1(a), (b) and (c) indicates that age clusters with on average fewer additional incidents in a narrower age cluster can have similar chances of being detected by the two-pass CUSUM plan. For example, in Figure 1(b), consider epidemic age-clustering groups 18 to 22 with an average of 79 total incidents versus epidemic age groups 17 to 23 with an average of 82 total incidents: both clusters have the same epicentre of 20, but the 79 total incidents with age cluster 18 to 22 have a roughly equal chance of detection at the epicentre as the 82 total-incidents example. (ii) Age-group false signals (signalling an age group to be part of a cluster when it is not) are more likely for ages near the boundary of the cluster than for those further away (e.g. in Figure 2(c), age 76 on the boundary of the cluster falsely signals that it is part of the cluster more often than age groups further from the cluster). C

2010 Australian Statistical Publishing Association Inc.

253

0.4

0.8

(a)

0.0

Probability of signalling

IDENTIFYING AGE-CLUSTER OUTBREAKS

0

20

40

60

80

60

80

60

80

0.4

0.8

(b)

0.0

Probability of signalling

Age group sequence number

0

20

40

0.4

0.8

(c)

0.0

Probability of signalling

Age group sequence number

0

20

40 Age group sequence number

Figure 2. Probability of the two-pass CUSUM plan signalling at specific age groups for the 80 age-group example. The dashed vertical lines indicate the boundary of the cluster.

(iii) Outbreaks are generally less likely to signal at the edge of a cluster (e.g. in Figure 2(a), age group 10 signals less often than age group 9). (iv) Outbreaks with the same average increase in counts occurring in the same number of age groups are more likely to signal if they have lower means. (For example, clusters over three age groups with on average 13 additional incidents in Figure 1(a) with means near zero are more likely to signal than in Figure 1(c) where means are closer to 3. This can be observed by comparing the probability profiles given by the line of dots in the two figures.) (v) Outbreaks with the same expected total number of additional incidents (i.e. 24 in Figure 2) are more likely to signal when age-group means are lower, particularly when they are spread over fewer ages (e.g. in Figure 2(a), the cluster in age groups 1 to 10 with lower expected means nearly always signals in the middle but, in Figure 2(b), outbreaks for age groups 37 to 43 with higher means are less likely to signal). C

2010 Australian Statistical Publishing Association Inc.

254

ROSS SPARKS

TABLE 5 The specificity for the two-pass CUSUM plan for various setting of the CUSUM’s average run length (ARL) and c In-control ARL of the CUSUM plan

75 87 93 100 125 150 175 200 300 400

Specificity for the 40 age-group problem

Specificity for the 60 age-group problem

Specificity for the 80 age-group problem

c

c

c

1.4

1.45

1.5

1.4

1.45

1.5

1.4

1.45

1.5

0.920 0.943 0.948 0.962 0.969 0.979 0.983 0.987 0.994 0.997

0.913 0.937 0.944 0.960 0.967 0.977 0.972 0.985 0.993 0.996

0.909 0.928 0.938 0.958 0.966 0.972 0.978 0.981 0.992 0.995

0.887 0.905 0.913 0.933 0.952 0.964 0.974 0.979 0.989 0.993

0.870 0.896 0.907 0.932 0.948 0.960 0.970 0.977 0.988 0.992

0.857 0.889 0.898 0.930 0.942 0.957 0.967 0.974 0.987 0.991

0.832 0.870 0.882 0.896 0.925 0.950 0.960 0.968 0.983 0.991

0.823 0.861 0.974 0.895 0.921 0.946 0.958 0.966 0.983 0.990

0.807 0.847 0.868 0.893 0.915 0.940 0.952 0.961 0.982 0.989

6. Properties of the two-pass CUSUM We can now apply the two-pass CUSUM to investigate the nature of false alarm rates. Sparks (2009) investigated the relative frequency for each cluster size for false alarms. The most frequent false alarm cluster by far was the one age-group ‘cluster’. 6.1. The relationship between in-control ARL of CUSUM and the specificity of the two-pass CUSUM Practitioners want the ability to specify their own level of significance for inference. To allow this capability, the relationship between the level of significance and in-control ARL is needed. Table 5 provides practitioners with the capability to design their own plans. Table 5 allows the reader to select c and in-control ARL to deliver plans with levels of significance equal to 10%, 5% and 1%. The bold specificities in Table 5 that are close to 0.9, 0.95 and 0.99 correspond to approximate tests of significance of 10%, 5% and 1%, respectively. Appendix A provides control limits h(λqt , c) for the two-pass CUSUM that lead to tests with the appropriate level of significance. 7. Example of application Individual-record health data were difficult to come by because of privacy issues, so we have used car crash data to demonstrate the method. The application considers accidents in New South Wales, Australia from 2000 to 2004 using data from the Traffic Accident Database System (TADS) collated by the NSW Roads and Traffic Authority. Data are collected from police reports for all accidents in which the accident resulted in either death or injury, at least one vehicle had to be towed away, at least one driver was reported as under the influence of alcohol, or at least $500 worth of damage to property was attributed to the movement of a vehicle on the road. The daily counts over the five-year period are displayed in Figure 3. C

2010 Australian Statistical Publishing Association Inc.

255

200 50

100

150

Daily counts

250

300

350

IDENTIFYING AGE-CLUSTER OUTBREAKS

0

500

1000

1500

Number of days from 1 Jan 1999

Figure 3. Daily counts for vehicle crashes in New South Wales from 2000 to 2004.

The study is restricted to accidents on roads accessible to the general public. The data are checked for consistency, duplications are excluded and data quality is routinely assessed. Crash reports are recorded at the earliest 24 hours after the event, and therefore this paper looks at monitoring daily accident counts in an attempt to find unusual trends in the accidents, with particular emphasis on people who cause the accidents. For other traffic surveillance work and discussion, see Sethi & Zwi (1999), Rossi et al. (2005) and Peden & Toroyan (2005). When an unusual increase in daily accidents occurs, the two-pass CUSUM is used to decide whether certain age groups have contributed significantly to this increase while others have not. That is, the aim is to detect clustering of the age groups that contributed significantly to unusual increases in accidents. The age of the person causing the accident is the focus of the study. Once a cluster is detected, we can attempt to characterize the nature of accidents caused by this age cluster relative to the remainder of the population. This information can be used to target safety campaigns to reduce such accidents in the future. The tails of the age distribution are grouped. Ages below 17 are grouped, ages above 74 are grouped, and all other ages (rounded to floor values) are in separate age groups. The count for age group q on day t is denoted Yq,t , and the model used is the following transitional Poisson regression model, where E(Yq,t ) = μq,t and with log(μq,t ) given by dwqκ(t) + β1q cos( t) + β2q sin( t) + β3q ph t + β4q sh t + α1q log(yq,t−1 + 1) + α2q log(yq,t−2 + 1), C

2010 Australian Statistical Publishing Association Inc.

256

ROSS SPARKS

where = 2π/365.25, dwqκ(t) is the mean adjustment signifying the influence of the κ(t)th day of the week, βiq and αiq are the regression coefficients associated with harmonic terms, holidays and transitional influence, and ph t = 1 and sh t = 1 when the tth day is a public and school holiday, respectively, and zero otherwise. This model is fitted using all the data, and the one-day-ahead forecast of counts is made with the fitted model. Assume that this one-day-ahead forecast is given by Yˆqt . If yqt is larger than the upper 0.1% point for a Poisson distribution with mean equal to Yˆqt , then the number of accidents for that day is flagged as unusual. The same model above is applied to total counts to establish their day-ahead forecast values. When counts are unusually large, the two-pass CUSUM is run on the age-group counts vector relative to the day-ahead forecast vector for all age groups in order to detect age clusters that contributed significantly to the unusual increase in accidents. For 22 November 2003, there were 217 accidents relative to the 136.69 expected from the one-day-ahead forecast. The two-pass CUSUM characterized this unusually high number of accidents as mostly caused by males between the ages of 17 and 31 and females between the age of 35 and 50. For males above 37, there was no evidence of an increase from forecasted values. However, for females there was some evidence of more incidents than forecasted for the age range 18 to 24 years. Clearly, male and female crash statistics behaved differently during this day. For 18 January 2001, 184 accidents relative to the 125.49 expected were reported. The two-pass CUSUM characterized this unusually high number of accidents as caused by (a) males between the ages of 18 and 24, (b) males between the ages of 31 and 41, and (c) females between the ages of 17 and 26. For females above 29 years old, there was evidence of mostly below forecasted counts, and incidents involving males above 41 years old were on average approximately equivalent to forecasted totals. Again males and females behaved differently during this day for at least ages above 30. For 15 November 2002, there were 223 accidents relative to the 144.03 expected. The two-pass CUSUM for data for sexes separately characterized this event by a higher than expected number of accidents for ages 17 to 42 for both sexes, and mostly about average to below average expected incidents outside this range. For 25 March 2001, there were 141 accidents relative to the 110.89 expected from the one-day-ahead forecast. Exponentially weighted moving-average (EWMA) charts on daily counts did not signal a significant change for this day, but the two-pass CUSUM applied to female and male data separately clearly indicated that males had significantly more accidents than expected, whereas females had fewer accidents than expected, particularly for ages greater than 42. Males had significantly high incidents around ages (a) 21 and 22, (b) 25, 26 and 27, and (c) 51. 8. Discussion The two-pass ACUSUM has been demonstrated to be a useful method for describing the nature of epidemics. Sparks (2009) gave an example in which there was more than a 50% chance of detecting a generated cluster even when the counts were below the expected value (i.e. no outbreak overall). Based on the evaluations in this paper, the two-pass ACUSUM is thus expected to have sufficiently reasonable detection probabilities and false alarm properties to be considered as a stand-alone surveillance tool. The paper has explored the weaknesses of C

2010 Australian Statistical Publishing Association Inc.

IDENTIFYING AGE-CLUSTER OUTBREAKS

257

0.8 0.6 0.2

0.4

Probability of signalling

0.6 0.4 0.2

Probability of signalling

0.8

1.0

(b)

1.0

(a)

0.0

0.2

0.4

0.6

0.8

Total incidents across all ages Age cluster f or groups 1-7 Age cluster f or groups 37-43 Age cluster f or groups 74-80

0.0

0.0

Total incidents across all ages Age cluster f or groups 1-7 Age cluster f or groups 17-23 Age cluster f or groups 34-40

1.0

Magnitude of the shift - p

0.0

0.2

0.4

0.6

0.8

1.0

Magnitude of the shift - p

Figure 4. Probability of the two-pass CUSUM plan with c = 1.45 signalling an age-cluster outbreak with a total of (a) 40, and (b) 80 possible age groups for various size shifts ( p).

the methodology by establishing where false alarms are more likely. The overall conclusion is that the two-pass ACUSUM is an improvement on inference-based Bonferroni corrections in terms of age-cluster detection. The two-pass CUSUM has been advocated to characterize the nature of signalled epidemics. In the future, it may be worthwhile to explore it as a surveillance tool. In other words, if epidemics usually start off in localized age groups before spreading to the broader community, then the two-pass adaptive CUSUM has the ability to flag an epidemic even when the total disease counts over all ages are not significantly higher than expected. Figures 4(a), (b) and 5 examine the detection probabilities relative to the usual c chart for generated epidemics: (i) in the tails or the middle of the age distribution for the 40, 80 and 40 age-group examples respectively, (ii) when the in-control means are uniformly distributed on the interval [0, 2] and arranged in ascending order. The epidemic in Figure 4 has additional epidemic counts generated with means equal to a constant (p ) times 3.1, 3.2, . . . , 3.7 for ages i to i + 6, respectively. If epidemics are located in the lower or middle age-groups, where means are lower, then the two-pass CUSUM is more efficient at detecting them than is the c chart for total counts. When the age-cluster epidemic occurs in the highest age groups, then aggregation across all ages is more efficient for lower C

2010 Australian Statistical Publishing Association Inc.

ROSS SPARKS

0.6 0.4 0.2

Probability of signalling

0.8

1.0

258

0.0

Total incidents across all ages Age cluster f or groups 1-5 Age cluster f or groups 18-22 Age cluster f or groups 34-38 Age cluster f or groups 36-40

0.0

0.2

0.4

0.6

0.8

1.0

Magnitude of the shift - p

Figure 5. Probability of the two-pass CUSUM plan with c = 1.45 detecting an age-cluster epidemic across five age groups when there are 40 possible age groups.

values of p (smaller changes). If epidemics are spread over fewer than seven age groups, this favours the two-pass CUSUM plan more than the c chart. Figure 5 represents results when epidemic clusters are generated with additional counts having vector of means equal to p × (3.3, 3.4, 3.5, 3.6, 3.7) for an application with 40 age groups. Further simulations have demonstrated the relative efficiency of a two-pass CUSUM plan for detecting epidemics when these are known to start in narrow age ranges, such as in schools. Future research effort is required to develop this beyond the ‘Shewhart-type’ charts illustrated here. Applying the EWMA smoothing to each age group across time before applying the two-pass CUSUM plan to cluster age groups has great potential as a viable surveillance tool in those cases where epidemics are known to start within narrow age clusters (fewer than 11 age groups for 80 age groupings and fewer than seven age groups for 40 age groupings). Recording the clusters detected for consecutive days is a helpful way of tracking how epidemics move between ages over time. This, together with spatial clustering technology such as scan statistics (e.g. Kulldorff 2001), may be helpful in understanding how epidemics move over time. Other areas of application not already considered include the following. (i) Monitoring changes in the workforce age distribution in particular industry groups in a country. What age groups have contributed to any sharp changes that are apparent? (ii) Monitoring changes in sales of certain products. Who in the population has contributed to significant increases? C

2010 Australian Statistical Publishing Association Inc.

IDENTIFYING AGE-CLUSTER OUTBREAKS

259

TABLE 6 Functions for calculating control limits for the homogeneous Poisson counts when dealing with steadystate situations (see Sparks 2009 for more accurate results) In-control average run length (ARL)

h(μ, c ) =

75

4.4581 + 4.3377μ − 1.9699μ2 + 0.5589μ3 + 0.032 log(c) + 1.0036 log(μ) − 0.07984μ4 + 0.0043μ5 − 9.5374μ log(c) + 4.27671μ2 log(c) − 1.2455μ3 log(c) + 0.1824μ4 log(c) − 0.010026μ5 log(c)

87

4.7951 + 4.76554μ − 2.2056μ2 + 0.6773μ3 + 0.16326 log(c) + 1.1038 log(μ) − 0.1073μ4 + 0.0064μ5 − 11.165μ log(c) + 5.1931μ2 log(c) − 1.6566μ3 log(c) + 0.27μ4 log(c) − 0.0162μ5 log(c)

93

5.4675 + 3.6857μ − 1.5116μ2 + 0.3754μ3 − 0.8161 log(c) + 1.2037 log(μ) − 0.0428μ4 + 0.0017μ5 − 9.2732μ log(c) + 4.04850μ2 log(c) − 1.08513μ3 log(c) + 0.13179μ4 log(c) − 0.005579μ5 log(c)

100

5.2388 + 4.05548μ − 1.2343μ2 + 0.1846μ3 − 0.4956 log(c) + 1.12697 log(μ) − 0.0094μ4 − 0.0001μ5 − 9.09511μ log(c) + 2.61631μ2 log(c) − 0.36376μ3 log(c) + 0.01349μ4 log(c) + 0.000528μ5 log(c)

125

6.0305 + 4.4257μ − 2.0587μ2 + 0.5967μ3 − 0.8535 log(c) + 1.3002 log(μ) − 0.08273μ4 + 0.00420μ5 − 11.0298μ log(c) + 5.242394μ2 log(c) − 1.582375μ3 log(c) + 0.22582μ4 log(c) − 0.011698μ5 log(c)

150

6.0378 + 5.5329μ − 2.5833μ2 + 0.7417μ3 − 0.5041 log(c) + 1.2702 log(μ) − 0.10520μ4 + 0.0056μ5 − 12.86236μ log(c) + 6.0649μ2 log(c) − 1.81826μ3 log(c) + 0.26692μ4 log(c) − 0.014607μ5 log(c)

175

6.8699 + 3.7926μ − 0.3251μ2 − 0.29688μ3 − 1.664 log(c) + 1.36808 log(μ) + 0.08288μ4 − 0.006μ5 − 8.11061μ log(c) − 0.44292μ2 log(c) + 1.19803μ3 log(c) − 0.27968μ4 log(c) + 0.01896μ5 log(c)

200

7.3297 + 5.0459μ − 2.1788μ2 + 0.6066μ3 − 1.2814 log(c) + 1.5245 log(μ) − 0.08728μ4 + 0.00479μ5 − 13.40259μ log(c) + 5.96720μ2 log(c) − 1.73093μ3 log(c) + 0.25450μ4 log(c) − 0.01411μ5 log(c)

300

6.8485 + 10.5685μ − 6.3101μ2 + 1.87883μ3 + 3.36286 log(c) + 1.78563 log(μ) − 0.25975μ4 + 0.01334μ5 − 29.82156μ log(c) + 17.77232μ2 log(c) − 5.30351μ3 log(c) + 0.73380μ4 log(c) − 0.03772μ5 log(c)

400

9.3522 + 6.55279μ − 2.71641μ2 + 0.61434μ3 − 2.19829 log(c) + 1.797284 log(μ) − 0.06934μ4 + 0.003081μ5 − 17.71577μ log(c) + 7.305487μ2 log(c) − 1.67901μ3 log(c) + 0.19249μ4 log(c) − 0.00868μ5 log(c)

(iii) Monitoring sports injuries. What age groups have contributed significantly to an increase in weekly sports injuries? Table 6 provides the formulae for the approximate control limits needed to apply these methods with the usual levels of significance (10%, 5% and 1%). These control limits are only applicable for c = 1.4, 1.45 or 1.5. The formulae are only valid for age groups with individual means within the range of 0.04 to 7, and cannot be used outside this range without risk. However, this range is sufficient for most applications. C

2010 Australian Statistical Publishing Association Inc.

260

ROSS SPARKS

References ELLIOTT, P. (1995). Investigation of disease risks in small areas. Occup. Environ. Med. 52, 265–275. ELLIOTT, P. & WAKEFIELD, J. (2001). Disease clusters: should they be investigated, and, if so, when and how? J. R. Statist. Soc. A. 164, 3–12. ELLIOTT, P., MARTUZZI, M. & SHADDICK, G. (1995). Spatial statistical methods in environmental epidemiology: a critique. Statist. Methods Med. Res. 4, 137–159. FLEMING, L.E., DUCATMAN, A.M. & SHALAT, S.L. (1992). Disease clusters in occupational medicine: a protocol for their investigation in the workplace. Amer. J. Ind. Med. 22, 33–47. GOULD, M.S., WALLENSTEIN, S. & KLEINMAN, M. (1990). Time-space clustering of teenage suicide. Am. J. Epidemiol. 131, 71–78. KONTY, K. & FARZAD MOSTASHARI, F. (2007). Cluster detection incorporating lagged test data. Adv. Disease Surveillance 2, 53. KULLDORFF, M. (2001). Prospective time periodic geographical disease surveillance using a scan statistic. J. R. Statist. Soc. A 164, 61–72. LUCAS, J.M. (1985). Counted data cusums. Technometrics 27, 129–144. PAGE, E.S. (1954). Continuous inspection schemes. Biometrika 41, 110–114. PEDEN, M. & TOROYAN, T. (2005). Counting road traffic deaths and injuries: poor data should not detract from doing something! Ann. Emerg. Med. 46, 158–160. ROSSI, G., LAMPUGNANI, L. & MARCHI, M. (1999). An approximate CUSUM procedure for surveillance of health events. Statist. Med. 18, 2111–2122. ROSSI, P.G., FARCHI, S., CHINI, F., CAMILLONI, L., BORGIA, P. & GUASTICCHI, G. (2005). Road traffic injuries in Lazio, Italy: A descriptive analysis from an emergency department-based surveillance system. Ann. Emerg. Med. 46, 152–157. SETHI, D. & ZWI, A. (1999). Traffic accidents another disaster? Eur. J. Public Health. 9, 65–66. SHEWHART, W.A. (1939). Statistical method from the viewpoint of quality control. The Graduate School , Washington, D.C: Department of Agriculture. SPARKS, R. (2009). Two-pass CUSUM for identifying age cluster outbreaks: The long version. CSIRO Mathematical and Information Sciences, Technical Report No. 09/116. SPARKS, R., CARTER, C., GRAHAM, P.L., MUSCATELLO, D., CHURCHES, T., KALDOR, J., TURNER, R., ZHENG, W. & RYAN, L. (2010). A strategy for understanding the sources of variation in syndromic surveillance for bioterrorism and public health incidence. IIE Transactions 42, 613–631. WARTENBERG, D. (2001). Investigating disease clusters: why, when and how? J. R. Statist. Soc. A 164, 13–22. WOODALL, W.H., MARSHALL, J.B., JONER, M.D., FRAKER, S.E. & ABDEL-SALAM, A.G. (2008). On the use of scan methods in health-related surveillance. J. R.. Statist. Soc. A 171, 223–237.

C

2010 Australian Statistical Publishing Association Inc.

Lihat lebih banyak...

Two-Pass Cusum to Identify Age-Cluster Outbreaks

Descrição do Produto

Comentários