Statistics Commentary Series: Commentary #8-Effect Sizes

June 27, 2017 | Autor: David Streiner | Categoria: Clinical Psychopharmacology
Share Embed


Descrição do Produto

COMMENTARY

Statistics Commentary Series Commentary #7—Statistical Inference: The Basics David L. Streiner, PhD, CPsych

A

ssume that we have done a study comparing 2 drugs in their ability to control anxiety. At the end, we find that the 50 people taking drug A improved an average of 10 points on an anxiety scale and the 50 on drug B improved an average of 5 points. Why can't we just hang up our lab coats, submit our findings to some journal, and go out for a celebratory drink? Why do we have to waste our time doing things such as t tests, analyses of variance, and other arcane procedures? In other words, why do we need statistics (or statisticians, for that matter)? The first reason seems so obvious that it hardly occurs to people: everyone is different. We are different not only in terms of age, height, weight, and other observable factors but also in terms of how we react to things. If every person placed on drug A improved by 10 points and each person taking drug B improved by 5 points, we wouldn't need statistics; we would simply enjoy our celebratory drink and publish our results. But the fact that people on drug A improved an average of 10 points hides the other fact that some people improved more than that and some less; indeed, it is quite likely that some people who took drug B may have improved more than some other people who took drug A. Statistics allows us to determine if the signal (how much the groups differ from each other) is greater than the noise (how much variability there is within each of the groups). The other reason we need statistics is that, at one level, we are not concerned with the 100 people in our study. What we are concerned about is the degree to which we can generalize our findings to all people with anxiety; that is, what does our sample of 100 tell us about the population of all people who suffer from anxiety? This is the realm of inferential statistics. (The other realm, that of descriptive statistics, allows us—as the name implies—to describe the samples, using numbers and graphs.) To begin with, let us move from measuring anxiety to systolic blood pressure (SBP), because we know what its mean and standard deviation (SD) are in the population (about 130 and 20 mm Hg, respectively), and it is relatively normally distributed, which is what we need for our purposes. (Very briefly, the SD is an index of how widely or narrowly the scores cluster around the mean.) Now imagine that a granting agency has been foolish enough to give us money to measure the SBP in the entire population, where the population is defined as adults in North America. If we were to plot these millions of values, we would find a normal curve with a mean of 130 and an SD of 20, as we would expect. On the basis of the properties of the normal curve, we know that most of the values will cluster around the mean. Roughly two thirds of the people would have values between +1 SD and −1 SD, or between 110 and 150. Similarly, slightly over 95% of the values will be between −2 and +2 SDs, or between 90 and 170. The more the SBPs deviate from the mean, the less likely they will occur, but they will occur (within the physiologically possible range). Next, assume we draw a random sample of, say, 1000 individuals and plot their scores. Yet again, we would get a normal curve with the same mean and SD, and again, the same provisos will hold—most of the people will have blood pressures (BPs) near the mean; the more deviant the values, the lower their probability, but BPs far from the mean will appear a certain proportion of the time. The third step involves drawing a sample of 100 cases and finding its mean. We now repeat this step 1000 or so times and now we plot, not the scores of the individual people but rather the means of these samples. What would we get? Again, it would be a normal curve with a mean of 100. However, the width of the curve would be considerably less. Let's think this through. When we measured individuals, a BP of 70 or less would be rare (occurring about 2.3% of the time) but well within the realm of possibility. If we measured 2 people and took their mean, an extreme value for 1 person would likely be counterbalanced by a more normal value for the other. The larger the group, the greater the chances that any extreme values will have little influence and the group mean will be closer to the population mean. By the time we get to a sample size of 100, most of the group means will be within a very narrow range; more

From the Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton; and Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada. Reprints: David L. Streiner, PhD, CPsych, St Joseph's Healthcare, Mountain Campus, 100 W 5th St, Hamilton, Ontario, Canada L8N 3K7 (e‐mail: [email protected]). Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved. ISSN: 0271-0749 DOI: 10.1097/JCP.0000000000000283

Journal of Clinical Psychopharmacology • Volume 35, Number 2, April 2015

www.psychopharmacology.com

Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved.

117

Journal of Clinical Psychopharmacology • Volume 35, Number 2, April 2015

Commentary

TABLE 1. The 4 Possible Outcomes of a Study “Truth” Study Results

Intervention Works

Intervention Does Not Work

Intervention Worked

(Cell A) ✓

(Cell B) Type I error (α)

Intervention Did Not Work

(Cell C) Type II error (β)

(Cell D) ✓

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi precisely, the SD of the mean scores will be SD= Sample size , which, in this case, is 20/10 = 2. The name of the SD of mean scores is the standard error of the mean (SEM), and we'll return to it in forthcoming commentaries. The important point is that, just as with individual scores, most group means will be near the population mean; the more the mean of a group deviates from the mean of the population, the less likely it will arise, but it will occur. Now let's go one last step (I can hear the cheers arising already). We will draw 2 random samples from the population and get their means and then subtract the second from the first. If we do this 1000 times and plot the difference scores, what will we get? Yet again, it will be a normal curve, except that this time the mean will be 0 because, on average, we do not expect the groups to differ from each other. However, the normal curve implies that there will be instances where the group means do differ—sometimes the first group will have a higher mean, and sometimes the second. As before, the more the difference score deviates from zero—that is, the further out in the left or right tail of the distribution—the less likely its occurrence, but it will occur. This fact lies at the heart of statistics. If we did a study in which we drew 2 samples from the population, gave 1 an experimental intervention and gave a placebo to the other, we would find a difference between the means. The question then arises, “Is that difference due to the fact that we treated the 2 groups differently, or did it arise by chance?” That is, is the difference real, or did we have the bad luck to get 2 samples that were out in the tail of the distribution, as in the last step of our examples? The definitive answer is, we will never know for sure. The best we can do is to determine the probability that the difference was due to chance. If this probability is low enough, we can say that the result was more likely because of the intervention than the play of chance. More specifically, we take a fairly conservative position and say that, if the probability is greater than 1 in 20 (ie, 5%), it could have been due to chance, that's too high, and we would conclude that the experiment did not show any effect of the drug. This, then, is the basis for the magical P < 0.05 we so fervently wish for in any of our studies. Before we go on, a word about the value of 0.05. These days, it has assumed the role of a sacred talisman: 0.051 and the study was an abject failure, 0.049 and it was a resounding success. But even Sir Ronald Fisher, the grandfather of most of our statistics, said:

No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.1 That is, would we lose faith in the efficacy of antipsychotic medications if 1 study reported a P level of 0.06? Likely not, because of the hundreds of previous studies showing that they work.

118

www.psychopharmacology.com

Conversely, would we believe in extrasensory perception if 1 study reported positive results with a P level of 0.04? In fact, 1 article reported a number of such results,2 and a journal has been publishing in the area since 1937, and still few people believe in it. In the words of Carl Sagan, “Extraordinary claims require extraordinary proof,” and Rosnow and Rosenthal3 said, “Surely God loves the .06 nearly as much as the .05.” So bear in mind that 0.05 is just a convention but one that, at least for now, we're stuck with. To return to the main story, though, remember that we are dealing with probabilities and not certainties, no matter what P level we adopt. That is, if the results of the study say that P is less than 0.05, there is still a possibility, albeit small, that the results could have been due to chance. At the same time, if P was greater than 0.05, it is possible that the intervention actually had an effect. In other words, there are 4 possible outcomes, summarized in Table 1. The columns reflect what “truth” is, that the intervention works or it doesn't. We never know that; all we have are the results of the study, represented by the rows—either it showed a positive effect of the intervention (ie, P was less than 0.05) or it did not (P was greater than 0.05). In cell A, the study was successful, and the truth of the matter is that the drug worked, so we came to the correct decision. Similarly, in cell D, our conclusions were again correct, in that the truth is that the intervention does not work and our results were negative. So far so good, but unfortunately, there are still 2 cells remaining. In cell B, we conclude that there was a difference between the groups, but in fact the intervention is useless; we have committed what is called a type I (or α-type) error. How often does this occur? By definition, if we adopted the 0.05 level for P, then it will happen 5% of the time. That is, if we have a completely useless drug but tested it in 100 studies, about 5 of them would show significant results. (Some cynics have postulated that it is only these 5% of studies that are published. This is probably too harsh, but it does contain a morsel of truth; positive findings are far more likely to be submitted to journals than negative ones4 and are more likely to be published.5) The opposite situation occurs in cell C: the intervention actually works, but the results of our study are negative. This is a type II (or β-type) error. In well-designed studies, the sample size is chosen so that this occurs 15% to 20% of the time. Before we leave this topic, there is one more term we should introduce: power. We have talked about the probabilities of coming to incorrect decisions but not that of coming to a correct one. The probability of correctly concluding that an effect was present, when in fact it was (ie, cell A), is called the power of the study or the statistical test and is defined as (1 − β). In a later commentary, we shall discuss how to determine sample size and power, as well as why we use different values for α and β. The takeaway messages from this commentary are the following: (1) the conclusions of any study are probabilities that the results are correct and are not definitive; (2) if the results are positive, there is still a chance that there was no effect; and (3) conversely, if the results were negative, there is a finite probability that © 2015 Wolters Kluwer Health, Inc. All rights reserved.

Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved.

Journal of Clinical Psychopharmacology • Volume 35, Number 2, April 2015

we missed a real effect. This is why statisticians' motto is that “Statistics means you never have to say you're certain.” AUTHOR DISCLOSURE INFORMATION The author declares no conflicts of interest. REFERENCES 1. Fisher RA. Statistical Methods and Scientific Inference. Edinburgh, Scotland: Oliver & Boyd; 1956.

© 2015 Wolters Kluwer Health, Inc. All rights reserved.

Commentary

2. Bem DJ. Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. J Pers Soc Psychol. 2011;100:407–425. 3. Rosnow L, Rosenthal R. Statistical procedures and the justification of knowledge in psychological science. Am Psychol. 1989;44:1276–1284. 4. Dickersin K, Chan S, Chalmers TC, et al. Publication bias and clinical trials. Control Clin Trials. 1987;8:343–353. 5. Begg CB, Berlin JA. Publication bias: a problem in interpreting medical data. J R Stat Soc. 1988;151:419–463.

www.psychopharmacology.com

Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved.

119

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.