Posted by Matt Wall
A recent paper (Gronenschild et al., 2012) has caused a modicum of concern amongst neuroimaging researchers. The paper documents a set of results based on analysis of anatomical MRI images using a popular free software tool called FreeSurfer, and essentially reports that there are (sometimes quite substantive) differences in the results that it produces, depending on the exact version of the software used, and whether the analyses were carried out on a Mac (running OS X) or a Hewlett Packard PC (running Linux). In fact, even the exact version of OS X on the Mac systems was also shown to be important in replicating results precisely.
The fact that results differ from one version of FreeSurfer to another is perhaps not so surprising – after all, we expect that newer versions of software should be ‘improved’ in important ways, otherwise, what would be the point in releasing them? However, the fact that results differ between operating systems is a little more worrying – in theory any operating system capable of running the software should produce the same result. The authors recommendations are that 1) Researchers should not switch from one version/operating system/platform to another in the middle of a research project, and 2) that when reporting results software version numbers, and the workstation/OS used should all be documented. This seems broadly sensible.
It got me thinking about neuroimaging software more generally as well though. In general, people don’t do detailed evaluations of software of the kind reported by Gronenschild et al. (2012). As an enthusiastic user of several fMRI-related packages (I’m currently using SPM, FSL and BrainVoyager, all on different projects) I’ve often wondered what the real differences were between them, in terms of the results they produce. Given how many people around the world use brain imaging software, you might think that some detailed evaluations would be floating around, but in fact there are very few.
I think there are several reasons for this:
1. It’s (perhaps understandably) regarded as a waste of time. After all, we (meaning researchers who use this software) are generally more interested in how the brain works, than by how software works. Neuroimaging is difficult and time-consuming and we all need to publish papers to survive – it makes more sense to spend our time on ‘real’ brain-related research.
2. Most people have one (or at most two) pieces of software that they like to use for neuroimaging, and they stick with it; I’m somewhat unusual in this respect. The fact that most people use just one package more-or-less exclusively means there’s a dearth of people who actually have the skills necessary to do cross-evaluation of packages. Again, this is understandable – why take the time to learn a new system, if you’re happy with the one you’re using?
3. The differences between the packages make precise comparison of end-results difficult. Even though all the packages use an application of the General Linear Model for basic analysis, other differences in pre-processing conceivably play a role. For instance, FSL handles the spatial transformation of functional data somewhat differently to other packages.
Having said that, there have been a few papers which have tried to do these kind of evaluations. Two examples are here (on motion correction) and here (on segmentation). Another somewhat instructive paper is this one, which summarises the results of a functional-imaging analysis contest held as part of the Human Brain Mapping meeting in Toronto in 2005; developers of popular neuroimaging software were all given the same set of data and asked to analyse it as best they could. Interesting stuff, but as the contestants all used somewhat different methods to get the most out of the data, it’s hard to draw direct comparisons.
If there’s a moral to this story, it’s that (as the recent Gronenschild et al. paper demonstrates) we need to pay close attention to this kind of thing. As responsible researchers we cannot simply assume our results will be replicable with different hardware and software, and detailed reporting of not just the analysis procedures, but also the tools used to achieve the results seems a simple and robust way of at least acknowledging the issue and enabling more precise replicability. Actually solving the issues involved is a substantially more difficult problem, and may be a job for future generations of researchers and developers.
My previous post on comparisons of different fMRI software: Here, here and here.
Neuroskeptic has also written a short piece on the recent paper mentioned above.