Comment on the Button et al. (2013) neuroscience ‘power-failure’ article in NRN

Statistical Spidey knows the score.

Statistical Spidey knows the score.

An article was published in Nature Reviews Neuroscience yesterday which caused a bit of a stir among neuroscientists (or at least among neuroscientists on Twitter, anyway). The authors cleverly used meta-analytic papers to estimate the ‘true’ power of an effect, and then (using the G*Power software) calculated the power for each individual study that made up the meta-analysis, based on the sample size of each one. Their conclusions are pretty damning for the field as a whole: an overall value of 21%, dropping to 8% in some sub-fields. This means that out of 100 studies that are conducted into a genuine effect, only 21 will actually demonstrate it.

The article has been discussed and summarised at length by Ed Yong, Christian Jarrett, and by Kate Button (the study’s first author) on Suzy Gage’s Guardian blog, so I’m not going to re-hash it any more here. The original paper is actually very accessible and well-written, and I encourage interested readers to start there. It’s definitely an important contribution to the debate, however (as always) there are alternative perspectives. I generally have a problem with over-reliance on power analyses (they’re often required for grant applications, and other project proposals). Prospective power analyses (i.e. those conducted before a piece of research is conducted, in order to tell you how many subjects you need) use an estimate of the effect size you expect to achieve – usually derived from previous work that has examined a (broadly) similar problem using (broadly) similar methods. This estimate is essentially a wild shot in the dark (especially because of some of the issues and biases discussed by Button et al., that are likely to operate in the literature), and the resulting power analysis therefore tells you (in my opinion) nothing very useful. Button et al. get around this issue by using the effect size from meta-analyses to estimate the ‘true’ effect size in a given literature area – a neat trick.

The remainder of this post deals with power-issues in fMRI, since it’s my area of expertise, and necessarily gets a bit technical. Readers who don’t have a somewhat nerdy interest in fMRI-methods are advised to check out some of the more accessible summaries linked to above. Braver readers – press on!

An alternative approach used in the fMRI field, and one that I’ve been following when planning projects for years, is a more empirical method. Murphy and Garavan (2004) took a large sample of 58 subjects who had completed a Go/No-Go task and analysed sub-sets of different sizes to look at the reproducibility of the results, with different sample sizes. They showed that reproducibility (assessed by correlation of the statistical maps with the ‘gold standard’ of the entire dataset; Fig. 4) reaches 80% at about 24 or 25 subjects. By this criterion, many fMRI studies are underpowered.

While I like this empirical approach to the issue, there are of course caveats and other things to consider. fMRI is a complex, highly technical research area, and heavily influenced by the advance of technology. MRI scanners have significantly improved in the last ten years, with 32 or even 64-channel head-coils becoming common, faster gradient switching, shorter TRs, higher field strength, and better field/data stability all meaning that the signal-to-noise has improved considerably. This serves to cut down one source of noise in fMRI data – intra-subject variance. The inter-subject variance of course remains the same as it always was, but that’s something that can’t really be mitigated against, and may even be of interest in some (between-group) studies. On the analysis side, new multivariate methods are much more sensitive to detecting differences than the standard mass-univariate approach. This improvement in effective SNR means that the Murphy and Garavan (2004) estimate of 25 subjects for 80% reproducibility may be somewhat inflated, and with modern techniques one could perhaps get away with less.

The other issue with the Murphy and Garavan (2004) approach is that it’s not very generalisable. The Go/No-Go task is widely used and is a ‘standard’ cognitive/attentional task that activates a well-described brain network, but other tasks may produce more or less activation, in different brain regions. Signal-to-noise varies widely across the brain, and across task-paradigms, with simple visual or motor experiments producing very large signal changes and complex cognitive tasks smaller ones. Yet another factor is the experimental design (blocked stimuli, or event-related),  the overall number of trials/stimuli presented, and the total scanning time for each subject, all of which can vary widely.

The upshot is that there are no easy answers, and this is something I try to impress upon people at every opportunity; particularly the statisticians who read my project proposals and object to me not including power analyses. I think prospective power analyses are not only uninformative, but give a false sense of security, and for that reason should be treated with caution. Ultimately the decision about how many subjects to test is generally highly influenced by other factors anyway (most notably, time, and money). You should test as many subjects as you reasonably can, and regard power analysis results as, at best, a rough guide.

About Matt Wall

I do brains. BRAINZZZZ.

Posted on April 11, 2013, in Commentary, Neuroimaging, Software and tagged , , , , , , , . Bookmark the permalink. 9 Comments.

  1. Hi Matt, I can’t comment directly on the power analysis debate since I’m not a statistician and my expertise ends microseconds after the raw data leaves the MRI scanner. But there are a few points I’d like to emphasize and clarify:

    “MRI scanners have significantly improved in the last ten years, with 32 or even 64-channel head-coils becoming common, faster gradient switching, shorter TRs, higher field strength, and better field/data stability all meaning that the signal-to-noise has improved considerably. This serves to cut down one source of noise in fMRI data – intra-subject variance.

    Until someone demonstrates that this is the case I wouldn’t assume anything! Indeed, the use of higher channel coils is increasing one form of motion sensitivity, not decreasing it. Also, if we consider techniques such as GRAPPA or multiband (aka simultaneous multislice) EPI, there are new spatial correlations compared to single-shot EPI and these have yet to be investigated fully, if at all. Finally, having higher SNR when the dominant N isn’t the electronic noise doesn’t necessarily help. If the noise is mostly physiologic then it’s entirely possible to be in the perverse situation where the improved scanner hardware has increased the physiologic noise, not decreased it! Again, to do this point to death, we are building houses of cards with many assumptions and well meaning predictions, but what we need are experimental proofs that fMRI is indeed “better” now than it was a decade ago. Prettier-looking images don’t necessarily make for better fMRI.

    Yet another factor is the experimental design (blocked stimuli, or event-related), the overall number of trials/stimuli presented, and the total scanning time for each subject, all of which can vary widely.”

    Indeed. Another common problem in fMRI is that we do extrapolate from one design to another without so much as a pause for validation. I would add to your list resting state fMRI. The field is already using essentially the same pre-processing pipeline for resting state as for task-based studies without, as far as I can tell, a whole lot of testing and validation. Just lots of untested assumptions. Thus, the same slice timing correction and motion correction algorithms will likely be applied to a set of 2D slices encoded with the new MB-EPI as for the old style plain EPI. Appropriate? You tell me!

    Until I saw a study yesterday that proposed to use resting state fMRI for pre-surgical planning (for epilepsy), my stock line was that we could get away with our paucity of validation because no lives were at stake. Soon, perhaps not. Perhaps as a field we will start to look a lot more closely at what goes into the sausages we’re making. There does seem to be a vanguard pushing for better stats, my hope is that we can stimulate an equivalent zeal for the acquisition and pre-processing.

  2. I think it is great to see an academic field engage in self-reflection and self-criticism and ask itself hard questions about the real value of its contributions. Overall this whole hullabaloo seems to me like a sign of good health for the neuropsych field.

  3. practiCalfMRI

    You beat me to it. I must commit to logging into twitter earlier in the day!

  4. Hi Matt,

    Although I of course agree that prospective power analyses should be regarded as not more precise than “educated guesses,” I vehemently disagree with your strong statement that prospective power analyses are “uninformative” and “not very useful.” I have two broad counter-points for you to consider:

    1. The untutored intuitions of most researchers in psychology and neuroscience about what kind of samples sizes would be required to have a reasonably well-powered study are *wildly, fantastically* bad. Consider for example a simple comparison between two independent groups. For some reason there is a pervasive intuition that a reasonably well-powered experiment for detecting a true difference in group means should have about 25 or 30 subjects in each of the two groups. (It is really not clear to me where these numbers come from, but there seems to be a surprising level of agreement among researchers in the field that something about in this range is probably sufficient.) However, an elementary power analysis, assuming the canonical “medium” effect size (more on that issue below) of Cohen’s d = 0.5, reveals that the power of such a study with n = 30 in each group has a power of 0.47… worse than a fair coin flip! It turns out that for an effect size of d = 0.5, achieving a reasonably high level power of, say, 0.8, would require 64 subjects per group, or 128 subjects in total.

    Of course, just as you suggest, this exact number (n = 128) should not be taken too terribly seriously. After all, it could very plausibly be that the effect size is somewhat larger or smaller than d = 0.5, and for that matter, maybe you think 0.8 is not quite right for your desired level of power. So there is some implicit uncertainty here. But for any reasonable tweaking of the input parameters that you try, you will virtually always find in this two-group case that the number of subjects per group necessary to reach your desired power level is not what you think… typically, it is *greater* than you think. We have known these faulty intuitions at least since Tversky & Kahneman’s “Belief in the Law of Small Numbers” paper, and all evidence indicates that it’s as true today as ever. The take-home point here is: If you think you know the expected power of your study without needing to do a power analysis… you don’t.

    2. Students often object that “but I don’t *know* what the effect size is for the effect I’m studying in this experiment… if I knew the effect size already, I wouldn’t need to do the experiment!” No, you don’t know in advance what the true effect size is… obviously. But we *do* have pretty good base rate information about what the effect size is likely to be for almost any particular future study, based on decades of meta-analyses that look at all corners of the research literature at various different levels of granularity.

    For one particularly nice example: if all that you are willing to say for sure about the effect you are studying is that it broadly falls under the rubric of “social psychology,” well, Richard, Bond, & Stokes-Zoota (2003) inform us that the average effect size reported across all of social psychology for essentially the entire history of the field is (after conversion) about d = 0.43. So even in the absence of ANY other information you might have about the study you are planning to run, you can suppose that the expected value of your effect size is equal to the historical average, that is, d = 0.43. Richard et al. (2003) go further in helpfully reporting average effect size estimates for each of a large variety of sub-fields of social psychology. So if you can identify which sub-field best characterizes the experimental effect that you are about to study, you can get an even slightly more precise educated guess about your expected effect size. And if you *further* know that a lot of people in this particular sub-field have spent a lot of time studying effects that look very much like the one you are about to study, then there will often be at least one even more specific meta-analysis out there in the literature for you to find, to obtain an even better educated guess about your expected effect size. And so on.

    It turns out that this kind of average effect size estimation has been done a LOT–it pretty much falls out of any meta-analysis–and the original, canonical levels suggested somewhat informally by Cohen for “small,” “medium,” and “large” effect sizes turn out to actually correspond pretty closely with values that we would today characterize as small, medium, or large based on a more systematic examination of typical effect sizes found in typical meta-analyses.

    I have a lot more I could say about power and why the pervasive neglect to properly consider it is a Very Bad Thing, but this note is way too long already. So in closing I’ll finally say: When researchers are left to design and conduct experiments using nothing more than their untutored intuition to determine how large experiments should be, what we end up with is (a) a vast proliferation of small, under-powered studies, which leads to (b) incredibly large “file drawers” of research efforts that were essentially doomed to fail from the start, and (c) a scientific literature chalk-full of effects that are notoriously hard to replicate because their magnitudes have been substantially overestimated by requiring them to jump over the high statistical hurdle imposed by low power (see Gelman on the “statistical significance filter” and “Type M errors,” and add in the effects of low statistical power). A culture of routine power analysis, even just very basic and unrefined power analysis, would be a good step toward putting us back on the right track at least with respect to the above problems, and as I have outlined above, there is really nothing so difficult or mysterious about it.

    • Dear Jake,

      Thanks so much for setting out your thoughts in detail here – I really appreciate it. I still think there are particular issues with power analyses for fMRI studies, but I’m totally on board with pretty much everything you say here – It is important stuff, and perhaps I was a little too flippant about it in the original post.

      The idea about taking effect size estimates from meta-analyses is an excellent one and, to be honest, it’s not something I’d really thought about before I read the Button et al. paper. It’s such an obviously good idea that I had one of those forehead-slapping ‘doh!’ moments.

      There could be an interesting little paper in there somewhere on experimenters intuitions about effect sizes, and how wrong they are – maybe it’s been done already though.

      Thanks again Jake, really valuable comments.

      M.

  1. Pingback: Power failure | Åse Fixes Science

Leave a comment