Multireader multicase variance analysis for binary data

June 19, 2017 | Autor: Kyle Myers | Categoria: Optical physics, Optometry and Ophthalmology, Electrical And Electronic Engineering

Share Embed

Denunciar este link

Descrição do Produto

B70

J. Opt. Soc. Am. A / Vol. 24, No. 12 / December 2007

Gallas et al.

Multireader multicase variance analysis for binary data Brandon D. Gallas,* Gene A. Pennello, and Kyle J. Myers National Institute of Biomedical Imaging and Bioengineering/Center for Derices and Radiological Health, Laboratory for the Assessment of Medical Imaging Systems, U.S. Food and Drug Administration, Silver Spring, Maryland, 20993, USA *Corresponding author: [email protected] Received April 13, 2007; revised July 7, 2007; accepted August 2, 2007; posted August 3, 2007 (Doc. ID 81774); published September 28, 2007 Multireader multicase (MRMC) variance analysis has become widely utilized to analyze observer studies for which the summary measure is the area under the receiver operating characteristic (ROC) curve. We extend MRMC variance analysis to binary data and also to generic study designs in which every reader may not interpret every case. A subset of the fundamental moments central to MRMC variance analysis of the area under the ROC curve (AUC) is found to be required. Through multiple simulation configurations, we compare our unbiased variance estimates to naïve estimates across a range of study designs, average percent correct, and numbers of readers and cases. OCIS codes: 000.5490, 110.3000, 330.5510.

1. INTRODUCTION The study of image quality often involves the use of psychophysical studies to evaluate an imaging system, or perhaps as a validation of model observer predictions for circumstances new to that model observer. Studies involving human readers are also central to the evaluation of new imaging technologies for which there is no alternative to the use of clinical images from actual patients. Just as important as the mean performance of the observer is the uncertainty of the measurement. Previous publications have presented methods for the analysis of the uncertainty in the summary measure of observer performance using the multireader multicase (MRMC) paradigm, mainly in the context of analyzing the area under the receiver operating characteristic (ROC) curve [1–5] and a “fully crossed” study design, where every reader reads every case. The data analyzed in these publications are typically the matrix of ROC scores obtained from each reader for each case. In this paper we present an unbiased method for estimating the variance in an experiment with multiple readers and multiple cases for which the outcomes are binary and the summary performance measure is a percent correct (PC). We also extend the analysis beyond the fully crossed study design to allow arbitrary study designs, including the “doctor–patient” study design, where each doctor sees his or her own patients. Some examples of PCs are sensitivity, specificity, and the PC in an M-alternative forced-choice (MAFC) experiment. Sensitivity is the percent of abnormals correctly identified, and specificity is the percent of normals correctly identified. We shall also refer to the abnormals as the signal-present cases (hypothesis 1, H1), and the normals as the signal-absent cases (hypothesis 0, H0). In an MAFC experiment, the reader must choose which of M-alternatives within a trial contains the signal. So, in 1084-7529/07/120B70-11/$15.00

the typical two-alternative forced-choice (2AFC) task a trial is often a pair of images, one signal-absent and one signal-present, displayed side by side or in sequence. The outcome of the choice is binary; the reader is either right or wrong. The rate at which the reader correctly picks the alternative with the signal is the PC. Regardless of the specific task, readers, and cases, we denote the binary success outcome generically by s共g , ␥兲, where g specifies the case and ␥ specifies the reader. This success outcome is 0 when reader r incorrectly identifies case g and 1 when the reader is successful. In a particular study, there is a set of Ng cases and a set of N␥ readers. Without replicating readings, we could collect Ng ⫻ N␥ outcomes if every reader reads every case (the fully crossed design). For the doctor–patient study design, depicted pictorially on the left of Fig. 1, some of these data are not collected. The shaded area in Fig. 1 indicates which cases were read by which readers. Since each case is read by only one reader, a significant amount of data are missing compared to the fully crossed design, which would fill the whole matrix. Additionally, we allow the number of cases, or “case load,” read by each reader to be different. On the right in Fig. 1 we provide a simple example demonstrating the data from a binary-outcome experiment with multiple readers, each reading their own cases. The PC in the last row weighs each reading equally: 100 correct decisions divided by 130 readings is 77%. Now one might assume that the readings are all independent and identically distributed (iid) and estimate the standard error using the sample variance divided by the total number of readings; this equals 3.7. However, since each reader may have a different skill at the task, the readings are not identically distributed and this naïve estimate likely underestimates the true variance. Instead of calculating the average performance as in © 2007 Optical Society of America

Gallas et al.

Vol. 24, No. 12 / December 2007 / J. Opt. Soc. Am. A

B71

Fig. 1. Graphic on the left shows the (transpose) layout of data from a binary-outcome experiment with multiple readers. Compared to a fully crossed data set, which would fill the entire matrix, much data are missing. The table on the right shows a simple example. The PC in the last row weighs each reading equally. One estimate of the standard error of that average considers all the readings to be iid. The result is 3.7. One could instead obtain an average PC by averaging the reader-specific PCs, resulting in 60. Continuing, one might estimate the standard error of this average, yielding 20.6. While both of the averages are valid, both variance estimates are wrong.

Fig. 1, one might average the three reader-specific PCs, yielding 共88+ 73+ 20兲 / 3 = 60, which is noticeably different from the previous average performance. One might continue and estimate the standard error using the sample variance of the three reader-specific PCs and divide by the number of readers, yielding 20.6. This result is more than five times that of the previous underestimate but, in reality, probably overestimates the true variance. This overestimate is due to the reader-specific PCs being noisy realizations of the true PCs. This simple example highlights two naïve estimates of variance. The first incorrectly treats the readings as identically distributed, and the second incorrectly treats the reader PCs as being measured without error. The variance estimate that we provide appropriately accounts for the readers, cases, and correlations that arise from the actual study design. These variance estimates apply to the average PC when readers are treated equally or when readings are treated equally. In what follows, we make the following assumptions: Readers are iid, cases are also iid, and readers are independent of cases. Additionally, given a reader and a case, an outcome can be deterministic, as when the reader is a mathematical classifier, or an outcome can be a random variable, as might be expected when the reader is a human and unable to reproduce the same decision on subsequent readings (reader jitter). This distinction is unnecessary for the current work; our variance estimate accounts for reader jitter whether it exists or not.

2. THEORY AND METHODS A. Setup We define a design matrix D and a success matrix S. Both matrices are Ng ⫻ N␥; their elements are denoted dir and sir, where i stands for the ith case and r for the rth reader. The design matrix holds a one in every position where an outcome was collected and a zero everywhere else. The success matrix holds the observed success outcomes sir = s共gi , ␥r兲. For the rth reader, we denote the number of Ng dir and the PC by cases read by Ng兩r = 兺i=1 pˆr =

1

Ng

兺

Ng兩r i=1

dirsir .

共1兲

When an outcome is not collected, dir = 0 and sir is technically undefined. In practice, we can set sir to any number we want when dir = 0, since it will always appear with dir and the product will always be zero. Therefore, to ease the transition to ensemble statistics, we think of sir as the success outcome whether or not it was collected in the study. We shall assume that the design matrix does not depend on the success matrix and vice versa, as such dependencies would certainly bias the study. In this paper we shall consider fixed study designs and random study designs. For a fixed study design, D is specified before data are collected; for a random study design, there is a protocol, or sampling scheme, that determines a distribution for the possible study designs. The typical endpoint in a study is a reader-averaged PC: N␥

Pˆ =

兺 w pˆ . r r

共2兲

r=1

While this average appears trivial, there is a choice to be made about how to average, that is, how to weigh each reader. Two common choices exist for the doctor–patient study design, as mentioned in the Introduction: weigh each reader equally 共wr = 1 / N␥兲 or weigh each reading equally 共wr = Ng兩r / Ng兲. We denote the resulting PCs as Pˆ␥ and Pˆg, respectively. Now, when cases are read by more than one reader, the total number of readings is more than Ng. Considering this situation, a more general expression for the second set of weights is wr N␥ = Ng兩r / 兺r=1 Ng兩r. These weights always sum to one. Of course, if each reader reads the same number of cases, Pˆg = Pˆ␥, whereas if the case load of each reader is random, the weights of Pˆg共⫽Pˆ␥兲 will also be random. Other choices for weights may be driven by the experience or skill of each reader. In the most general framework the weights are arbitrary, as long as they sum to one. B. Population Quantities 1. Fixed Study Designs ˆ for a fixed study design D is straightforThe mean of P ward:

B72

J. Opt. Soc. Am. A / Vol. 24, No. 12 / December 2007

Gallas et al.

具Pˆ兩D典 = 具s共g, ␥兲兩D典 = 具s共g, ␥兲典.

共3兲

Note that we use brackets 具…典 to denote expected values over all random variables, and the notation 具. . .兩D典 denotes the conditional expected value where we fix the design matrix D and average over the remaining random quantities: the readers and cases. The expected readeraveraged PC, as is shown above, has no dependence on the study design or the reader weights. Next, carefully accounting for possible correlations across readers and cases (see Appendix A), the population variance of Pˆ for a fixed study design is

from the entire population. Since readeres are sampled from a common population, this variance does not depend on any particular reader. This variance depends only on the number of cases read, which can be different for each reader depending on the study design. The covariance of pˆr , pˆr⬘ has a stronger dependence on the study design since it considers two readers. In Appendix A we derive the variance and covariance appearing in Eq. (5). The single-reader variance is var共pˆr兩Ng兩兩r兩兲 = 具pˆ2r 兩Ng兩兩r兩典 − 具pˆr典2

V兩D兩 = var共Pˆ兩D兲 = c1具s共g, ␥兲2典 + c4具具s共g, ␥兲兩␥典2典 + c5具具s共g, ␥兲兩g典2典 + c8具s共g, ␥兲典2 ,

where the coefficients depend on the study design and the reader weights [see Eqs. (A7)–(A10)]. The unique numbering of the coefficients above is driven by how we label the moments. We refer to the moments in Eq. (4) as M1, M4, M5, and M8 to coincide with notation previously derived for the empirical area under the ROC curve (AUC) [4,5]. For AUC, there are eight fundamental moments of the success outcomes. The factor of 2 increase in the number of moments comes from partitioning cases into two subsets: signal-absent and signalpresent. The variance can be written concisely as a scalar product between the coefficients and the moments arranged in vectors cគ and M គ ; that is, V兩D = cគ tM គ , where coefficients c2, c3, c6, and c7 are all understood to equal zero. This variance will carry a subscript ␥ or g when needed to indicate weights treating each reader equally or weights treating each reading equally. The moments themselves are nothing more than second moments (M1 through M7) and a mean squared 共M8兲, as are expected in a variance. Finally, we shall extend this notation to include M0 = 具s共g , ␥兲典, the success outcome averaged over reader ␥ and case g. The simple form of the variance expression in Eq. (4) hides complexity that comes with all the different possible study designs and weights. It is worthwhile to see how the variance of Pˆ is related to the variances of the readerspecific pˆr. In general, N␥

V 兩D兩 =

兺 r=1

N␥

w2r var共pˆr兩D兲 +

N␥

兺兺 w w ⬘ cov共pˆ ,pˆ ⬘兩D兲. r

r

=

共4兲

r

r

共5兲

r=1 r⬘⫽r

The variance of pˆr given D is the variance you get when you select a random reader and a random set of Ng兩r cases

冉

1 N g兩r兩

M1 +

共6兲

N g兩r兩 − 1 N g兩r兩

冊

共7兲

M4 − M8 .

Notice that the first two terms of the expression above are second moments weighted by coefficients that sum to one and the last term is a negative mean squared. In this way, the expression fits our mental picture that a variance can be decomposed into a second moment minus a mean squared. The covariance for the general study design simplifies for the fully crossed and doctor–patient study designs. The covariance for the fully crossed study design is Eq. (A13) minus the mean squared, or

cov共pˆr,pˆr⬘兲 =

1 Ng

1 M5 −

Ng

共8兲

M8 ,

whereas for the doctor–patient study design, the covariance is zero (readers are independent and read different cases). 2. Special Cases and Random Study Designs The vector of coefficients for a fixed study design is made up of complicated sums that simplify for the study designs considered in this paper (see Table 1). If we allow the study design to be random (with some distribution), ˆ by averaging the coefficients of we get the variance of P the fixed study design over the distribution of study designs. This is possible because we assume the design and success matrices are independent. When averaged over the distribution of study designs, the variance is no longer dependent, or conditional, on a fixed study, and the subscript 兩D should be dropped.

Table 1. List of the Coefficients Needed to Appropriately Weight the Success Moments to Determine the ˆ for a Fixed Study Designa Variance of P Doctor–Patient cគ c 1= c 4= c 5= c 8= a

Fully Crossed 1 / N ␥N g 共Ng − 1兲 / N␥Ng 共N␥ − 1兲 / N␥Ng 关共Ng − 1兲共N␥ − 1兲 − N␥Ng兴 / N␥Ng

wr = 1 / N␥ 1 / Ng 共Ng − N␥兲 / N␥Ng 0 −1 / N␥

wr = Ng兩r / Ng

General

1 / Ng N␥ 兺r=1 共Ng2兩r − Ng兩r兲 / Ng2 0 N␥ −兺r=1 共Ng兩r / Ng兲2

R 兺r=1 wr2 / Ng兩r R 2 兺r=1关wr 共Ng兩r − 1兲兴 / Ng兩r

Each column corresponds to a different study design and set of reader weights. Coefficients 2, 3, 6, and 7 are all zero.

0 R

R 兺r=1 兺r

⬘⫽r

w rw r⬘

Gallas et al.

Vol. 24, No. 12 / December 2007 / J. Opt. Soc. Am. A

C. Variance Estimates 1. Fixed Study Design Expressing the fixed-study-design variance of Pˆ as a linear combination of moments, as described in the previous section, leads to the unbiased moment estimator that we present here, ˆ = cគ tM ˆគ , V 兩D兩

共9兲

where the vector of coefficients cគ is the same as before. The estimates of the moments are derived by replacing the expected values defining the moments [Eqs. (A3)–(A6)] with sums over the readers and cases. The estimates are as follows: N␥

ˆ = M 1

Ng

兺兺 wr

r=1

i=1

2 2 dir sir

兺

共10兲

,

Ng

2 d i*r

i*=1 N␥

ˆ = M 4

Ng

兺兺 wr

r=1

i=1

Ng

dirsir

兺d

d i⬘rs i⬘r

兺

Ng

i*r

Ng

i *r

i⬘*⫽i

i*=1 N␥

N␥

ˆ = M 5

Ng

w r⬘

兺兺 1−w 兺 wr

r⬘⫽r

r=1

dirdir⬘sirsir⬘

r i=1

Ng

兺d

i*=1 N␥

ˆ = M 8

Ng

兺兺 wr

r=1

i=1

dirsir Ng

兺d

i*=1

共11兲

,

兺 d⬘

i⬘⫽i

i*r

N␥

i*rd i*r⬘

w r⬘

Ng

兺 1−w 兺

r⬘⫽r

共12兲

,

r i⬘⫽i

d i⬘r⬘s i⬘r⬘ Ng

兺d

i⬘*⫽i

.

共13兲

i*⬘r⬘

The weights for each pair of observations are analogous to the weights used for the average performance: Each case (or pair of cases) is given equal weight for each reader, and the readers are given the same (relative) weights as before. In theory, the weights could be differˆ ; however, there does not seem to be a ent from those for P good reason to make them different. As for Pˆ and V兩D, we ˆ when necessary to indicate add the subscript ␥ or g to V 兩D whether the weights equally weigh each reader or each reading. In situations where two readers r , r⬘ have nonoverlapˆ can be zero. ping case samples, the denominator in M 5 But at the same time, the numerator will be zero as well. ˆ is taken to In these situations the r , r⬘ contribution to M 5 be zero. Consequently, for the doctor–patient study deˆ is ensign, where readers never read the same cases, M 5 tirely zero. When replacing the expected values with sums for estimation there are two things to remember: Avoid biases and count the number of samples that are being summed. The elements of the design matrix are an easy way to count the number of samples that are being summed. Biases creep in when we replace a squared average with a squared sum. To avoid the bias, replace the squared average with two sums and do not include the in-

B73

dex of the first sum in the second sum. For example, the estimate of M5 squares the average over readers for a fixed case. When replacing this squared average with sums over r and r⬘, we do not let r⬘ equal r. We also normalize the weights in the sum over r⬘ so that they sum to one. The result can be shown to be unbiased with standard algebraic and probabilistic manipulations. Unfortunately, our moment-based MRMC variance estimate is not necessarily positive. It is a linear combination of sums of squares, where one coefficient, c8, is negative. The possibility of negative estimates are an unfortunate consequence of estimating variances with sums of squares and too few samples. Bayesian and maximum-likelihood estimates could avoid the unfortunate negative estimates, but that approach is beyond the scope of the nonparametric treatment of this paper. 2. Random Study Design The only change needed to account for random study designs is to replace cគ with an estimate of 具cគ 典. One estimate of 具cគ 典 is just the observed cគ itself, which would not be an actual change of the fixed-study-design variance estimator. Other estimates of 具cគ 典 would require priors on the distribution of possible study designs. For this manuscript, we shall investigate the fixed-study-design estimator and consider other estimators at a later date. 3. Naïve Estimates As a basis for comparison, we consider the two naïve estimates described in the Introduction. Neither accounts for the MRMC nature of the data, but both have been used in the literature. The first estimate essentially assumes that all the readings are iid, indirectly assuming that readers all have the same skill and are reading different cases. Given this assumption, the success outcomes are all independent Bernoulli trials with the same probability of success, and the variance of the reader-averaged PC is estimated as ˆ V naiveគg =

N␥ Ng

1

兺兺d

Ntotal共Ntotal − 1兲 r=1

ir共sir

− Pˆ␥兲2 ,

共14兲

i=1

where Ntotal denotes the total number of readings, dir, summed over all readers and cases. The second estimate uses the sample variance of the reader-specific PCs: ˆ V naiveគ␥ =

1

N␥

兺共pˆ

N␥共N␥ − 1兲 r=1

r

− Pˆ␥兲2 .

共15兲

This statistic essentially assumes that the PCs are not noisy, that they represent the true reader skill (infinite testers). This statistic also assumes that when two readers read the same case, the outcomes are independent, ignoring the correlation by the case. D. Simulation 1. Model We shall utilize the Monte Carlo (MC) simulation scheme developed by Roe and Metz [6] to investigate the variance estimates presented above in a 2AFC experiment. This simulation scheme assumes that a reader generates two

B74

J. Opt. Soc. Am. A / Vol. 24, No. 12 / December 2007

Gallas et al.

scores 共t0ir,t1ir兲 for each case, where a case represents a signal-absent and signal-present pair of alternatives. If the score of the signal-absent alternative is lower than the score of the signal-present alternative, the success outcome for the case is one; otherwise, it is zero: sir = s共t1ir − t0ir兲 =

再

1 if t1ir − t0ir ⬎ 0 0 if t1ir − t0ir ⬍ 0

冎

.

共16兲

The model for the scores is a sum of Gaussian random variables: t0ir = 0 + 关R兴0r + 关C兴0i + 关RC兴0ir ,

共17兲

t1ir = ␮t + 关R兴1r + 关C兴1i + 关RC兴1ir .

共18兲

Here, t0ir and t1ir are the rth reader’s scores for the signal-absent and signal-present alternatives of the ith case. Except for ␮t, which indicates the separation between the two score distributions, the terms in t0ir , t1ir are independent zero-mean Gaussian random variables that 2 2 兲, the case effect 共␴C 兲, we refer to as the reader effect 共␴R 2 and the reader/case interaction effect 共␴RC兲. We shall follow the convenient constraint used by Roe and Metz on the sum of the variances of the random effects such that 2 ␴R

+

␴C2

+

2 ␴RC

共19兲

= 1.

Pr共pr 艋 ␶兲 =

共20兲

where s共t1ir − t0ir兲 equals one for t1ir ⬎ t0ir, and zero otherwise. Since we have fixed the reader, the remaining randomness in t1ir − t0ir is the sum of two case terms and two reader/case terms. Since all these terms are independent, t1ir − t0ir given 关R兴1r − 关R兴0r is a Guassian random variable 2 2 + 2␴RC . with mean ␮t + 关R兴1r − 关R兴0r and variance 2␴C Therefore, pr = ⌽

冉冑

␮t + 关R兴1r − 关R兴0r 2 2 2␴C + 2␴RC

冊

,

where ⌽ is the cumulative distribution function (cdf) of the standard normal. Furthermore, since the only randomness in this last expression comes from the currently fixed reader effects 关R兴1r − 关R兴0r, the cdf of reader skill is given by

2 dx exp共− x2/4␴R 兲

−⬁

冑4␲␴R2

⌽

冉冑

␮t + x 2 2 2␴C + 2␴RC

冊

. 共21兲

Unfortunately, we were unable to find a closed-form solution for this cdf. So, to find the average reader skill, we have two options. The first option is to (numerically) calculate the average over the two independent reader components [Eq. (21)], letting ␶ go to infinity. The second option starts over, eliminating the condition on r in Eq. (20). Noticing that t1ir − t0ir is simply a Gaussian with variance two centered on ␮t, M0 = 具pr典 = 具s共t1ir − t0ir兲典 = ⌽共␮t/冑2兲,

共22兲

as intended by Roe and Metz [6]. Please note that for population quantities, M1 equals M0 (because s2 = s) and M8 equals M02. This leaves M4 and M5 as the remaining second-order moments unaccounted for in this problem. Without a familiar probability density function (pdf) for pr, the only option we found for calculating these moments is through numerical integration. The integral expressions for M4 and M5 are M4 =

冕

⬁

−⬁

With such a simple description for the scores, we can characterize the distribution of the PC to second order. First, the rth reader’s skill averaged over all cases is pr = 具pˆr兩r典 = 具s共t1ir − t0ir兲兩r典,

冕

␶

2 dx exp共− x2/4␴R 兲

冑4␲␴R2

冋冉冑 ⌽

␮t + x 2 2 2␴C + 2␴RC

冊册

2

, 共23兲

M5 =

冕

⬁

−⬁

2 dx exp共− x2/4␴C 兲

冑4␲␴C2

冋冉冑 ⌽

␮t + x 2 2 2␴R + 2␴RC

冊册

2

. 共24兲

Numerical calculation of these one-dimensional integrals is a readily tractable task. 2. Simulation Configurations The relevant parameters for the simulation are listed in Table 2. We vary all the simulation parameters in a factorial design, yielding 3 ⫻ 3 ⫻ 3 ⫻ 3 ⫻ 3 ⫻ 3 = 729 total configurations. For each of these, we run 10,000 MC trials. Compared to the simulation parameters of Roe and Metz 2 [6], we consider a broader range of reader variance 共␴R 兲 for the scores, especially on the high end. The range they considered was 1%–10% of the total; our range is 5%– 83%. Another factor that we investigate is how the cases are distributed among the readers. We investigate six study designs with the expected number of cases read by each

Table 2. Parameters Investigated in the Simulations According to a Factorial Design, Yielding a Total of 3 Ã 3 Ã 3 Ã 3 Ã 2 Ã 2 = 324 Simulation Configurations Relative Components of Variance on the Scores

Experimental Design Performance: No. of readers: Mean no. of trials/cases:

PC = 关0.96, 0.86, 0.70兴 N␥ = 关3 , 5 , 10兴 ¯ = 关12, 51, 102兴 N g兩r

Reader: Case: Interaction:

2 ␴R = 关0.05, 0.10, 0.50兴 ␴C2 = 关0.05, 0.10, 0.50兴 2 ␴RC = 关0.05, 0.10, 0.50兴

Gallas et al.

Vol. 24, No. 12 / December 2007 / J. Opt. Soc. Am. A

reader given in Table 2. Table 3 exemplifies the study designs with five readers and an average of 102 cases read by each reader. The first four of the study designs listed are doctor–patient study designs, the next is fully crossed, and the last has a unique hybrid structure that is neither fully crossed nor doctor–patient. The first doctor–patient study design is flat; every reader reads the same number of cases. For the Poisson doctor–patient study design, the number of cases each reader reads is five cases plus a Poisson random variable ¯ − 5. For the uniform distributions, the with mean N g兩r ¯ − 5兴 for the number is selected from the interval 关5 , 2 ⴱ N g兩r ¯ , 1.5⫻ N ¯ 兴 for the moderbroad distribution or 关0.5ⴱ N g兩r g兩r ate distribution. These distributions force a minimum of five readings per reader. The final study design we consider is motivated by an observer study conducted by investigators at the National Cancer Institute. The observer study used a subset of images from the atypical squamous cells of undetermined significance (ASCUS) low-grade squamous intraepithelial lesion (LSIL) triage study known as ALTS [7,8]. In that study a small subset of the cases were read by all the study colposcopists. The remaining cases were each read ¯ /3 by three readers. Here we have a data set of N␥共N g兩r − 3兲 cases. Each reader reads the first three cases of this data set; the remaining cases are each read by three randomly selected readers. The curious size of the data set is chosen so that the total number of readings for this study ¯ , the same total expected for the other design is N␥ ⫻ N g兩r study designs. We shall refer to this study design as the hybrid study design. Finally, we consider both weighting methods mentioned above: equally weighing readers and equally weighing readings.

3. SIMULATION RESULTS AND DISCUSSION In what follows, we compare our variance estimators to the truth, the population quantities. Our population quantities are calculated from the integral expressions in Eqs. (22)–(24) and the MC averages for the coefficients. So the truth still has an element of uncertainty in it; the expected values of the nonlinear coefficients are intractable. To verify the integral expressions, we compare each population variance (from integration) to the sample vari-

B75

ances of 10,000 independent MC performance estimates. A separate point is given for each of the 729 simulation configurations, 6 types of study designs, and both ways to weigh individual reader PCs in Fig. 2. Across all these simulation configurations, which cover a broad range of variances, the maximum difference found was 6% and the mean was −0.1%. Expected Variance. Before we assess our estimates, it is worthwhile to show the variances expected from all the experiments. Figure 3 shows the population variances for all the high-PC (0.96) simulation configurations compared to the expected values (from MC averaging) of the naïve variance estimates. The expected values of our moment estimators are unbiased and thus equal the population variances. At the bottom of each column of plots, the x axis is labeled according to the size of the simulated experiment. The 27 different components of variance configurations are then explored within each experiment size 2 according to the reader component of variance 共␴R 兲. This sorting shows that the reader component of variance has a strong impact on the expected variance of the experiment. The size of the experiment also affects the experimental variance, though to a lesser degree. Additionally, the simulation configurations for lower PCs (0.86, 0.70; not shown) are quite similar to those given in Fig. 3 except that they are shifted upward. This behavior mimics the binomial variance, which increases with decreasing performance. Interestingly, the overall scale across the different study designs is relatively constant for each experiment size. Recall that each study design has the same expected number of readings given the same experiment size. However, we can see that different study designs behave differently across different components-of-variance configurations. Regarding the impact of reader weights, V␥ = var共Pˆ␥兲 lies on top of Vg = var共Pˆg兲 in all the plots except for the broad uniform doctor–patient study design (Fig. 3(d)). In that plot, Vg can be ±30% that of V␥ (notice some dots peaking out from behind the solid curve). What this means is that the variance of the reader-averaged PC does not depend on the reader weights except when the reader case loads are very different. Finally, in each plot the naïve estimates bracket the ˆ true MRMC variances: 具V naiveគ␥典 is biased high (the dotted

Table 3. One Example of the Six Distributions of Cases for Five Readers Reading 102 Cases on Average No. of Independent Cases per Reader Study Design

R1

R2

R3

Flat 102 102 102 Poisson 93 100 89 Moderate uniform [52, 152] 85 142 120 Broad uniform [12, 192] 141 25 171 Fully crossed Readers read same 102 Hybrid Please refer to the text

R4

R5

102 106 81 118 cases

102 98 123 74

Fig. 2. Population variances calculated using the integral expressions for the moments compared to those estimated from MC. A separate point is given for each of the 324 simulation configurations, 6 types of study designs, and both ways to weigh individual reader PCs.

B76

J. Opt. Soc. Am. A / Vol. 24, No. 12 / December 2007

Gallas et al.

Fig. 3. Population variances for all the high-PC (0.96) simulation configurations compared to the expected values (from MC averaging) of the naïve variance estimates. The expected values of our moment estimators are unbiased and thus equal the population variances. ˆ ˆ is biased high (dotted curve) and V is biased low (dashed curve). In The naïve estimates bracket the true MRMC variances: V naiveគ␥

naiveគg

all but panel D, Vg lies on top of V␥ (notice some dots peaking out from behind the solid curve).

ˆ curve upper bound) and 具V naiveគg典 is biased low (the ˆ dashed curve lower bound). In the plots, V naiveគ␥ can be ˆ nine times the true variance, whereas Vnaiveគg can be as little as 2% of the true variance. Root-mean-square error. Here we assess the variance estimators with the relative root-mean-square error (RRMSE), or 1 RRMSE =

V

ˆ − 具V ˆ 典典2 + var共V ˆ 兲兲1/2 , 共具V

共25兲

where the first term in the parentheses is the squared bias of the variance estimate and the second term is the variance of the variance estimate; the term in front of the parentheses scales the RMSE to the truth. Thus, the scale of RRMSE can be interpreted as the total error given as a fraction of what we are trying to estimate. Of course, since our moment estimator is unbiased, the RRMSE can also be interpreted as just the standard deviation of our estimator relative to what is being estimated. Figure 4 plots the RRMSE 共⫻100% 兲 for the fully crossed study design: Plot A shows the high PC (0.96), and Plot B shows the low PC (0.70). As for the previous plots, the x axis is labeled according to the size of the simulated experiment, while the different variance configurations are explored within each experiment size, sorted by the 2 reader component of variance 共␴R 兲. Recall that for the fully crossed study design, equally weighing readers is

the same as equally weighing readings. Consequently, our ˆ are also equal: MRMC variance estimators of Pˆ and P ␥

g

ˆ . ˆ =V V ␥ g We first point out that at high PC (Fig. 4(a)) and with only three readers, the RRMSE of our MRMC estimators runs above 100% (solid curve). Three readers are not enough to do the MRMC variance estimation, and as the reader component of variance increases, the estimator gets even noisier. In this regime, the naïve estimator ˆ V appears to be performing fairly well (dashed naiveគg

curve), that is, until we recall how biased it is (Fig. 3(e)). ˆ The bias of V naiveគ␥, on the other hand, is driving the RRMSE to extreme values (dotted curve). As the size of the experiment grows, the RRMSE of our ˆ does MRMC estimator decreases, while that of V naiveគg

not. That is to say, our MRMC estimator improves with more data, while the naïve estimator cannot adapt to the overdispersive nature of the data. Nonetheless, even with ten readers, each reading 102 cases on average, our MRMC estimator has too much error when the PC and reader variability are high. When the PC is lower (Fig. 4(b)), the estimation problem can be done with reasonable precision and accurracy. When there are ten readers and 50 cases in the experiment, the RRMSE of our MRMC estimator ranges between 20% and 40%. For the broad uniform study design (Figs. 5(a) and 5(b)), the overall story is similar, but we now see a differ-

Gallas et al.

Vol. 24, No. 12 / December 2007 / J. Opt. Soc. Am. A

Fig. 4.

RRMSE for the fully crossed study design: A, high PC (0.96); B, low PC (0.70).

ence from the reader weights. In experiments with little data and high PC, the error estimating Vg (dashed-dotted curve) is significantly larger than that estimating V␥ (solid curve). However, for the experiments with adequate readers (ten) and moderate PC, where the RRMSE ranges between 30% and 60%, the difference in errors becomes negligible. Finally, the RRMSE stories for the other study designs are similar to either that of the fully crossed or the broad uniform study designs. The hybrid study design, with its additional case correlations from cases being read by at least three readers, mimics the fully crossed study design. The other doctor–patient study designs mimic the broad uniform doctor–patient study design, although the differˆ and V ˆ are not as proences between the RRMSEs for V ␥ g nounced. In summary, the reader weights do not play a significant role in the total variance of the average PC except when the case loads are very different, as in the broad

Fig. 5.

B77

uniform study design. Additionally, it takes about ten readers and a moderate PC to reasonably estimate the MRMC variance. In this regime the error estimating V␥ and Vg is about the same. Finally, our MRMC estimator improves as more data are collected and performance is moderate; it is a consistent estimator. In contrast, the naïve estimators are not consistent; they do not get closer to the truth with more data.

4. CONCLUSIONS AND FUTURE WORK We have presented a framework for estimating the variance of a binary-outcome experiment that appropriately accounts for readers and cases as random effects. This framework is based on the larger one developed for estimating the MRMC variance of AUC [4,5] obtained according to a fully crossed study design. The MRMC variance of AUC has eight fundamental second-order moments of the success outcomes, whereas for the binary-outcome experi-

RRMSE for the broad uniform study design: A, high PC (0.96); B, low PC (0.70).

B78

J. Opt. Soc. Am. A / Vol. 24, No. 12 / December 2007

ment there are only four fundamental moments. We have also generalized the framework to accommodate any MRMC random or fixed study design. A fully crossed study design is not required, though we have highlighted it and another special study design, the doctor–patient study design. In addition to quantifying the uncertainty of the MRMC experiment conducted, the framework provided can be used to consider other study designs. For example, a small pilot study can be used to estimate the moments of the success outcomes. Then a larger pivotal study can be considered by simply changing the study design matrix, which will change the coefficients cគ . This larger pivotal study does not even need to be of the same type as the pilot study, as long as the appropriate moments have been estimated. We have examined our estimator with the MC simulation scheme developed by Roe and Metz.[6] This simulation was originally developed to investigate the Dorfman– Berbaum–Metz (DBM) linear-random-effects (components-of-variance) model of AUC [1] and has since served as a testbed for assessing other MRMC approaches [9–11]. Within our framework, we have also been able to derive integral expressions for numerically calculating the fundamental moments of the success outcomes for the Roe and Metz simulation. Extending these results to the eight fundamental moments of the MRMC variance of AUC is available upon request from the author and is being drafted for publication. This result ties off a loose end that has been present since the simulation model was developed. For a short discussion showing how the success moments are related to the components of variance, see Appendix B. The variance estimates presented are useful for the visual perception investigator performing clinical studies or human psychophysics experiments, as well as for the investigator developing models of the human or ideal observer. For the latter, the utility comes to bear when the model observer is estimated from a finite set of training cases. If another set of cases is obtained (same size), another estimate of the observer (same model) could be obtained. These two model-observer estimates can be thought of as samples from a population of readers. In this setting, a MRMC performance experiment can be run where we generate a sample of readers (trained on independent sets of cases) and a sample of testing cases (cases that are independent of the ones used for training any observer). Performing an MRMC variance analysis on this experiment will allow the investigator to account for the variability from training the model with a finite set of training samples and from testing the model with a finite set of testers. Such an accounting is essential to model development and is starting to be appreciated in the field of computer-aided diagnosis and detection of disease [12]. One direction for future work in this area is to estimate MRMC covariances. The method we presented in this paper generalizes easily to estimating covariances when the readers and cases are paired across two reading conditions or modalities. Simply replace the success matrix with a difference of success matrices and proceed as described for the single-modality MRMC variance analysis. These covariances can be used to quantify the statistical

Gallas et al.

difference between the performance of a set of readers reading the same cases in two modalities, or the difference between two observer models. Another direction for future work is to take the general study design concepts to AUC [5]. Pooling ROC scores happens just as pooling success outcomes happens. The subsequent variance analysis and hypothesis tests done do not typically account for the fact that the scores from several readers reading different cases are not identically distributed. For AUC, however, not only is the variance analysis wrong, but the actual pooled AUC can be quite different from the average reader AUC [13], especially when the readers use the ROC score axis differently.

APPENDIX A: SECOND-MOMENT, FIXED STUDY DESIGN Here we assume that the design matrix and weights are ˆ . It is fixed, and we calculate the second moment of P

冓冉兺 N␥

具Pˆ2兩D典 =

r=1

wr

兺d

Ng兩r兩 i=1

冊冔 2

Ng

irsir

.

共A1兲

The squared sum over readers and cases is a quadruple sum that we separate into four parts: N␥

具Pˆ2兩D典 =

w2r

Ng

N␥

兺N 兺 r=1

N␥

+

2 2 dir 具sir典

2 g兩r兩 i=1 N␥

w rw r⬘

兺兺N N␥

N␥

兺兺N r=1 r⬘⫽r

2 g兩r兩 i=1 i⬘⫽i

irdi⬘r具sirsi⬘r典

Ng

兺d ⬘

g兩r兩Ng兩r 兩 i=1

w rw r⬘

Ng Ng

兺N 兺兺d r=1

r=1 r⬘⫽r

+

+

w2r

irdir⬘具sirsir⬘典

Ng Ng

兺兺d

g兩r兩Ng兩r⬘兩 i=1 i⬘⫽i

irdi⬘r⬘具sirsi⬘r⬘典.

共A2兲

Since the readers and cases are iid, the moments in each line of the expression above do not depend on r, r⬘ or i, i⬘, which we define to coincide with notation previously derived for the empirical AUC [4,5]. 2 典 = 具s共g, ␥兲2典, M1 = 具sir

共A3兲

M4 = 具sirsi⬘r典 = 具具s共g, ␥兲兩␥典2典,

共A4兲

M5 = 具sirsir⬘典 = 具具s共g, ␥兲兩g典2典,

共A5兲

M8 = 具sirsi⬘r⬘典 = 具s共g, ␥兲典2 .

共A6兲

It is worth pointing out that, for the binary-outcome prob2 2 = dir and sir = sir. Therefore, M8 = M12. lem, dir Given that the moments in Eq. (A2) are independent of the readers r, r⬘ and cases i, i⬘, we can see that the second moment is simply four moments weighted by four coefficients. The variance utilizes the same four coefficients, while subtracting 1 from the last coefficient to account for subtracting the mean squared from 具Pˆ2 兩 D典. Therefore, after some algebraic manipulations, the coefficients are

Gallas et al. N␥

c1 =

Vol. 24, No. 12 / December 2007 / J. Opt. Soc. Am. A

w2r

兺N r=1

g兩r兩

N␥

c4 =

N␥

2 r

N␥

r=1

N␥

r=1 r⬘⫽r N␥

c8 =

g兩r兩 Ng

兺d

g兩r兩Ng兩r⬘兩 i=1

N␥

r

r=1 r⬘⫽r

兺d

g兩r兩Ng兩r⬘兩 i=1

pˆrpˆr⬘ =

dirsir dir⬘sir⬘

兺N i=1

g兩r兩

N g兩r⬘兩

Ng Ng

+

兺兺

i=1 i⬘⫽i

dirsir di⬘r⬘si⬘r⬘ N g兩r兩

These two quantities arise naturally from the variance obtained when estimating the reader-specific performance of a random reader ␥ reading a random set of N兩g兩␥ cases. Specifically,

共A10兲

irdir⬘ .

Note that Eqs. (A9) and (A10) can be rewritten so that the sums over r⬘ do not need to skip the rth term, making the computer implementation more efficient. The general case expression in Eq. (A2) simplifies for the study designs considered in this paper (see Table 1). For the fully crossed study design, dir always equals one, so sums over all i equal Ng and sums over i⬘ ⫽ i equal Ng − 1. For doctor–patient study designs, readers never read the same cases, so the sum over i of dirdir⬘ always equals zero and the sum over i and i⬘ ⫽ i of dirdir⬘ = Ng兩rNg兩r⬘. It is also handy to derive the expected value of Ng

共B3兲

var共pˆ␥兩Ng兩兩␥兩兲 = var共具pˆ␥兩␥,Ng兩兩␥兩典兲 + 具var共pˆ␥兩␥,Ng兩兩␥兩兲典

Ng

w rw r⬘

兺兺 w w ⬘−兺兺 N r

共A9兲

irdir⬘ ,

N␥

N␥

r=1 r⬘⫽r

共B2兲

共A8兲

,

w rw r⬘

兺兺N

␮␥ = 具p␥典 = 具具s共g, ␥兲兩␥典典 = M0 , ␴␥2 = 具具s共g, ␥兲兩␥典2典 − 具s共g, ␥兲典2 = M4 − M8 .

w2r

兺w −兺 N r=1

c5 =

共A7兲

,

B79

N g兩r⬘兩

共B4兲

=var共p␥兲 +

= ␴␥2 +

冓

p␥共1 − p␥兲 N g兩␥兩

␮␥共1 − ␮␥兲 − ␴␥2 N g兩r兩

冔

共B5兲

共B6兲

.

Likewise, we consider the distribution of case difficulty. The case difficulty is the success outcome for a given case g averaged over all readers in the population, or pg = 具s共g, ␥兲兩g典.

共B7兲

As with reader skill, we have a mean and variance of the case difficulty,

. 共A11兲

␮g = 具pg典 = 具具s共g, ␥兲兩g典典 = M0 ,

共B8兲

When the readers are the same, r⬘ = r, 具pˆ2r 典

1 =

N g兩r兩

M1 +

共Ng兩r兩 − 1兲 N g兩r兩

M4 ,

共A12兲

regardless of the study design. When the readers are different, r⬘ ⫽ r, the expected value depends on the study design. For the fully crossed (FC) and doctor–patient (Dr-Pt) study designs, 具pˆrpˆr⬘兩FC典 =

1 Ng

M5 +

共Ng − 1兲 Ng

M8 ,

具pˆrpˆr⬘兩Dr − Pt典 = M8 .

共A13兲共A14兲

APPENDIX B: COMPONENTS OF VARIANCE In this section we relate our moment decomposition of the variance given in Eq. (4) to a components-of-variances (CofVs) decomposition [1,2,14]. We begin by considering the distribution of reader skill; some readers are better than others. The skill of a reader is the success outcome for a given reader ␥ averaged over all cases in the population, or p␥ = 具s共g, ␥兲兩␥典.

共B1兲

We shall denote the mean and variance of this distribution, respectively, as

␴g2 = 具具s共g, ␥兲兩g典2典 − 具s共g, ␥兲典2 = M5 − M8 . 共B9兲 Instead of the development above, the DBM model starts by decomposing the performance into three random effects: pˆG␥ = ¯␤ + ␤␥ + ␤G + ␤G␥ ,

共B10兲

where G denotes a set of cases; ¯␤ denotes the average performance; ␤␥ is a random effect accounting for reader skill; ␤G is a random effect accounting for the difficulty of the case set; and ␤G␥ quantifies two random effects, a possible reader-case interaction and reader jitter. The interaction and reader jitter effects are inseparable if there are no repeated readings. All the random effects are assumed to be independent zero-mean Gaussian random variables. The corresponding reader CofV is identical to ␴␥2, and the case CofV equals ␴g2 scaled per case set, or ␴g2 / Ng兩r. At first, the variance of the interaction term is not obvious. The reason is that the variance of the interaction term depends on the study design. It depends on how the readers and cases are sampled and combined in the summary performance statistic. We can actually figure out the variance of the interaction term by starting with the total variance and organizing it according to reciprocal powers of N␥, Ng, much like is done in the work of Barrett et al. [15,16]. For the fully crossed study design, we have

B80

J. Opt. Soc. Am. A / Vol. 24, No. 12 / December 2007

var共Pˆ兩FC兲 =

M4 − M8 N␥

+

M5 − M8 Ng

+

M1 − M4 − M5 + M8 N ␥N g

Gallas et al.

.

6.

共B11兲 The first term is ␴␥2 / N␥, the second term is ␴g2 / Ng, and we define the third term to be the variance of the interaction term (divided by N␥Ng). For the flat doctor–patient study design, var共Pˆ兩flat Dr − Pt兲 =

M4 − M8 N␥

+

M1 − M4 N ␥N g兩r兩

.

7. 8.

共B12兲

Here we find a reader CofV (numerator of first term), but we do not find a case CofV. Instead, we see that the second term is divided by the number of readers and the number of cases read by each reader. Thus we define the numerator of the second term as the interaction term.

9.

10.

11.

REFERENCES 1.

2.

3.

4. 5.

D. D. Dorfman, K. S. Berbaum, and C. E. Metz, “Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method,” Invest. Radiol. 27, 723–731 (1992). S. V. Beiden, R. F. Wagner, and G. Campbell, “Componentsof-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis,” Acad. Radiol. 7, 341–349 (2000). N. A. Obuchowski, S. V. Beiden, K. S. Berbaum, S. L. Hillis, H. Ishwaran, H. H. Song, and R. F. Wagner, “Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods,” Acad. Radiol. 11, 980–995 (2004). B. D. Gallas, “One-shot estimate of MRMC variance: AUC,” Acad. Radiol. 13, 353–362 (2006). B. D. Gallas and D. G. Brown, “Reader studies for

12.

13. 14. 15. 16.

validation of CAD systems,” submitted to Neural Networks. C. A. Roe and C. E. Metz, “Dorfman–Berbaum–Metz method for statistical analysis of multireader, multimodality receiver operating characteristic (ROC) data: validation with computer simulation,” Acad. Radiol. 4, 298–303 (1997). M. Schiffman and M. E. Adrianza, “ASCUS-LSIL triage study: design, methods and characteristics of trial participants,” Acta Cytol. 44, 726–742 (2000). J. Jeronimo, L. S. Massad, and M. Schiffman, “Visual appearance of the uterine cervix: correlation with human papillomavirus detection and type,” Am. J. Obstet. Gynecol. 97, 47.e1–47.e8 (2007). S. L. Hillis and K. S. Berbaum, “Monte Carlo validation of the Dorfman–Berbaum–Metz method using normalized pseudovalues and less data-based model simplification,” Acad. Radiol. 12, 1534–1541 (2005). S. L. Hillis, N. A. Obuchowski, K. M. Schartz, and K. S. Berbaum, “A comparison of the Dorfman–Berbaum–Metz and Obuchowski–Rockette methods for receiver operating characteristic (ROC) data,” Stat. Med. 24, 1579–1607 (2005). X. Song and X.-H. Zhou, “A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data,” Biostatistics 6, 303–312 (2005). W. A. Yousef, R. F. Wagner, and M. H. Loew, “Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach,” IEEE Trans. Pattern Anal. Mach. Intell. 28, 1809–1817 (2006). M. S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction (Oxford U. Press, 2003). C. A. Roe and C. E. Metz, “Variance-component modeling in the analysis of receiver operating characteristic (ROC) index estimates,” Acad. Radiol. 4, 587–600 (1997). H. H. Barrett, M. A. Kupinski, and E. Clarkson, “Probabilistic Foundations of the MRMC Method,” Proc. SPIE 5749, 21–31 (2005). E. Clarkson, M. A. Kupinski, and H. H. Barrett, “A probabilistic model for the MRMC method. Part 1. theoretical development,” Acad. Radiol. 13, 1410–1421 (2006).

Lihat lebih banyak...

Multireader multicase variance analysis for binary data

Descrição do Produto

Comentários