Posted by Matt Wall
An article was published in Nature Reviews Neuroscience yesterday which caused a bit of a stir among neuroscientists (or at least among neuroscientists on Twitter, anyway). The authors cleverly used meta-analytic papers to estimate the ‘true’ power of an effect, and then (using the G*Power software) calculated the power for each individual study that made up the meta-analysis, based on the sample size of each one. Their conclusions are pretty damning for the field as a whole: an overall value of 21%, dropping to 8% in some sub-fields. This means that out of 100 studies that are conducted into a genuine effect, only 21 will actually demonstrate it.
The article has been discussed and summarised at length by Ed Yong, Christian Jarrett, and by Kate Button (the study’s first author) on Suzy Gage’s Guardian blog, so I’m not going to re-hash it any more here. The original paper is actually very accessible and well-written, and I encourage interested readers to start there. It’s definitely an important contribution to the debate, however (as always) there are alternative perspectives. I generally have a problem with over-reliance on power analyses (they’re often required for grant applications, and other project proposals). Prospective power analyses (i.e. those conducted before a piece of research is conducted, in order to tell you how many subjects you need) use an estimate of the effect size you expect to achieve – usually derived from previous work that has examined a (broadly) similar problem using (broadly) similar methods. This estimate is essentially a wild shot in the dark (especially because of some of the issues and biases discussed by Button et al., that are likely to operate in the literature), and the resulting power analysis therefore tells you (in my opinion) nothing very useful. Button et al. get around this issue by using the effect size from meta-analyses to estimate the ‘true’ effect size in a given literature area – a neat trick.
The remainder of this post deals with power-issues in fMRI, since it’s my area of expertise, and necessarily gets a bit technical. Readers who don’t have a somewhat nerdy interest in fMRI-methods are advised to check out some of the more accessible summaries linked to above. Braver readers – press on!
An alternative approach used in the fMRI field, and one that I’ve been following when planning projects for years, is a more empirical method. Murphy and Garavan (2004) took a large sample of 58 subjects who had completed a Go/No-Go task and analysed sub-sets of different sizes to look at the reproducibility of the results, with different sample sizes. They showed that reproducibility (assessed by correlation of the statistical maps with the ‘gold standard’ of the entire dataset; Fig. 4) reaches 80% at about 24 or 25 subjects. By this criterion, many fMRI studies are underpowered.
While I like this empirical approach to the issue, there are of course caveats and other things to consider. fMRI is a complex, highly technical research area, and heavily influenced by the advance of technology. MRI scanners have significantly improved in the last ten years, with 32 or even 64-channel head-coils becoming common, faster gradient switching, shorter TRs, higher field strength, and better field/data stability all meaning that the signal-to-noise has improved considerably. This serves to cut down one source of noise in fMRI data – intra-subject variance. The inter-subject variance of course remains the same as it always was, but that’s something that can’t really be mitigated against, and may even be of interest in some (between-group) studies. On the analysis side, new multivariate methods are much more sensitive to detecting differences than the standard mass-univariate approach. This improvement in effective SNR means that the Murphy and Garavan (2004) estimate of 25 subjects for 80% reproducibility may be somewhat inflated, and with modern techniques one could perhaps get away with less.
The other issue with the Murphy and Garavan (2004) approach is that it’s not very generalisable. The Go/No-Go task is widely used and is a ‘standard’ cognitive/attentional task that activates a well-described brain network, but other tasks may produce more or less activation, in different brain regions. Signal-to-noise varies widely across the brain, and across task-paradigms, with simple visual or motor experiments producing very large signal changes and complex cognitive tasks smaller ones. Yet another factor is the experimental design (blocked stimuli, or event-related), the overall number of trials/stimuli presented, and the total scanning time for each subject, all of which can vary widely.
The upshot is that there are no easy answers, and this is something I try to impress upon people at every opportunity; particularly the statisticians who read my project proposals and object to me not including power analyses. I think prospective power analyses are not only uninformative, but give a false sense of security, and for that reason should be treated with caution. Ultimately the decision about how many subjects to test is generally highly influenced by other factors anyway (most notably, time, and money). You should test as many subjects as you reasonably can, and regard power analysis results as, at best, a rough guide.