Friday, December 28, 2012

which labels to permute?

Which labels should be permuted for a permutation test of a single person's classification accuracy? A quick look found examples of MVPA methods papers using all three possibilities: relabel training set only (Pereira 2011), relabel testing set only (Al-Rawi 2012), or relabel both training and testing sets (Stelzer 2012). Note that I'm considering class label permutations only, "Test 1" in the parlance of Ojala 2010.

This reminds me of the situation with searchlight shape: many different implementations of the same idea are possible, and we really need to be more specific when we report results: often papers don't specify the scheme they used.

Which permutation scheme is best? As usual, I doubt there is a single, all-purpose answer. I put together this little simulation to explore one part of the effect of the choice: what do the null distributions look like under each label permutation scheme? The short answer is that the null distributions look quite similar (normal and centered on chance), but there is a strong relationship between the proportion of labels permuted and accuracy when only the test or training set labels are permuted, but not when both are permuted.

simulation

These simulations use a simple mvpa-like dataset for one person, two classes (chance is 0.5), 10 examples of each class in each run, two runs, and full balance (same number of trials of each class in each run, no missings). I made the data by sampling from a normal distribution for each class, standard deviation 1, mean 0.15 for one class and -0.15 for the other, 50 voxels. I classified with a linear svm, c=1, partitioning on the runs (so 2-fold cross-validation). I used R; email me for a copy of the code. I ran the simulation 10 times (10 different datasets), with the same dataset used for each permutation scheme.

1500 label permutations of each sort (training-only, testing-only, both) were run, chosen at random from all those possible. I coded it up such that the same relabeling was used for each of the cross-validation folds when only the training or testing data labels were permuted (e.g. the classifier was trained on the 1st run of the real data, then the permuted label scheme was applied to the 2nd run and the classifier tested. Then the classifier was trained on the 2nd run of the real data, and the SAME permuted label scheme applied to the 1st run and the classifier tested.). This was simply for convenience, when coding, but restricts the number of possibilities; another example of how the same idea can be implemented multiple ways.

This is a moderately difficult classification: the average accuracy of the true-labeled data (i.e. not permuted) was 0.695, ranging from 0.55 to 0.775 over the 10 repetitions. The accuracy of each dataset is given in the plot titles, and by a reddish dotted line.

For each permutation I recorded both the accuracy and the proportion of labels matching between that permutation and the real labels. When both training and testing labels are permuted this is an average over the two cross-validation folds. I plotted the classification accuracy of each permutation against the proportion of labels matching in the permutation and calculated the correlation. Histograms of each variable appear along the axes. These graphs are complicated, but enlarge if you click on them.

Training set labels only permuted:

Testing set labels only permuted:

both Training and Testing set labels permuted:
  

Here are density plots of all 30 null distributions, overplotted. These are the curves that appear along the y-axis in the above graphs.

observations

There is a strong linear relationship between the number of labels changed in a permutation and its accuracy when either the training or testing set labels alone are shuffled: the more the relabeling resembles the true data labels, the better the accuracy. When all labels are permuted there isn't much of a relationship.

Despite the strong correlation, the null distributions resulting from each permutation scheme are quite similar (density plot overlap graph). This makes sense, since the relabelings are chosen at random, so relabelings quite similar and quite dissimilar to the true labeling are included. The null distributions would be skewed if the labels for the permutations were not chosen at random (e.g. centered above chance if only mostly-matching relabelings were used).

comments

Intuitively, I prefer the permute-both scheme: more permutations are possible, and the strong correlation is absent. But since the resulting null distributions are so similar, I can't say that permuting just one set of labels or the other is really worse, much less invalid. This is quite a simplified simulation; I think it would be prudent to check null distributions and relabeling schemes in actual use, since non-random label samplings may turn up.

references

  • Al-Rawi, M.S., Cunha, J.P.S., 2012. On using permutation tests to estimate the classification significance of functional magnetic resonance imaging data. Neurocomputing 82, 224-233. http://dx.doi.org/10.1016/j.neucom.2011.11.007
  • Ojala, M., Garriga, G.C. Permutation tests for studying classifier performance. J. Mach. Learn. Res., 11 (2010), pp. 1833–1863.
  • Pereira, F., Botvinick, M., 2011. Information mapping with pattern classifiers: A comparative study. Neuroimage 56, 476-496.
  • Stelzer, J., Chen, Y., Turner, R., 2013. Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. Neuroimage 65, 69-82.

Friday, December 7, 2012

Try R tutorial

R can have a steep learning curve if you're new to programming. Try R is a set of online tutorials that are visually attractive and light-hearted, but cover the basics. Worth a "try"!

Thursday, November 29, 2012

surface or volume searchlighting ... not a mixture

I saw a report with an analysis that used volumetric searchlights (spheres) within a grey matter mask (like the one at the right, though theirs was somewhat more dilated). The cortical thickness in the mask was usually around 10 voxels, and they ran a 3-voxel radius searchlight.

I do not think this is a good idea.

My short general suggestion is to do either volumetric searchlighting of the whole brain (or a large 3D subset of it - like the frontal lobes), or do surface-based searchlighting (e.g. Chen, Oosterhof), but not mix the two.

A fundamental problem is that a fairly small proportion of the searchlights will be fully in the brain using a volumetric grey matter mask.
In this little cartoon the yellow circles represent the spherical searchlights, and the two grey lines the extent of the grey matter mask. Many searchlights do not fully fall into the mask; the edges will be sampled differently than the center.

This cartoon tries to convey the same idea: only the strip in the middle of the mask is mapped by searchlights completely in the brain. If informative voxels are evenly distributed in the grey matter strip, searchlights in the middle (fully in the brain) could be more likely to be significant (depending on the exact classifier, information distribution, etc.), a distorted impression.

I've heard it suggested that it's better to use a grey matter mask because that's where the information should be. I don't think that's a good motivation. For one thing, running the searchlight analysis on the whole brain can serve as a control: if the "most informative areas" turn out to be the ventricles, something went wrong. For another, spatial normalization is not perfect. Depending on how things are run (searchlight on spatially-normalized brains or not, etc), there is likely to be some blurring of white and grey matter in the functional images. Letting the searchlights span both may capture more of the information actually present.

One final point. Volumetric searchlighting will put non-strip-wise-adjacent areas (areas touching across a sulcus) into the same searchlight. This might be ok, given the spatial uncertainties accepted in some fMRI. But if you want to avoid this, surface-based methods are the way to go, not a volumetric grey matter mask.

Monday, November 19, 2012

postdoc position: Susanne Quadflieg's lab

Susanne Quadflieg is looking for a postdoc. The position is fMRI and social psychology research, not MVPA methods, but might interest some of you.

permutation testing: feedback

MS Al-Rawi sent some feedback about my last post and kindly agreed that I could post it here. His comments/questions are in quote blocks, my responses in regular type.

"1- The title: 'groups of people'. Are you trying to perform permutation testing for a group of subjects (people), so, is it 'groups of people', or group analysis?"

I meant group analysis (aka 2nd level analysis): "is this effect present in these subjects?" The goal is to generate a significance level for the group-level statistic, such as the accuracy averaged across subjects.

"2- You also wrote; 'Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person'. So, how do you get this set of  'p unique relabelings', or you don't have to find them, you just do find p times relabelings?"
Sorry for the confusion; I was struggling a bit with terminology. What I mean is that there is a single set of label permutations which are used for every person, rather than generating unique label permutations for everyone. The set of label permutations is created in the 'usual' way. If the number of labels is small all possible permutations should be used, but when there are too many for that to be practical a random subset can be generated.

For example, suppose there are two classes (a and b) and three examples of each class. If the true labeling (i.e. row order of the data table) is aaabbb, possible permutations include aabbba, ababab, etc.

"3- 'applied such that there is a linking between the permutations in each person'. I cannot figure out what that is? I can tell that it gives a biased distribution, either all yielding best accuracy(s) that are closer to the real one, or all yielding closer to chance level accuracy."
What I meant was that, if all subjects have the same number and type of examples, the same (single) permutation scheme can be used for everyone. To continue the example, we would calculate the accuracy with the aabbba relabeling in every subject, then the accuracy with the ababab relabeling, etc.

I guess this gives a biased distribution in the sense that fewer relabelings are included ... When a random subset of the permutations has to be used (because there are too many to calculate them all), under scheme 2 you could generate a separate set of permutations for each person (e.g. aabbba might be used with subject 3 but not subjects 1, 2, or 4). The single set of permutations used for everyone is not biased (assuming it was generated properly), but does sample a smaller number of relabelings than if you generate unique relabelings for each person.

"4- Since each person/subject (data) has its own classifier (or a correlated set of classifiers due to using k-fold cross validation), is it legitimate to take the mean of each of the 'p unique relabelings' (as you showed in scheme 1)?"

This is one of the reasons why I made the post: to ask for opinions on its legitimacy! It feels more fair to me to take the across-subjects mean when the same labeling is used for everyone than averaging values from different labelings. But a feeling isn't proof, and it's possible that scheme 1 is too stringent.

"5- The formula; ((number of permutation means) > (real mean)) / (r + 1)
I think using ((number of permutation means) > (real mean) +1 ) / (r + 1) would prevent getting p=0 value when the classification accuracy is highly above change. We shouldn't be getting 0 in any case, but that is most probably to happen because the number of permutations is limited by the computational power, and the randomization labeling might not be perfectly tailored enough to generate an idealistic distribution (e.g. at one permutation should capture the original labeling or one that is highly close to it and thus give high classification accuracy)."
Yes, I think this is right: if the true labeling is better than all 1000 permutations we want the resulting p-value to come out as 0.001, not 0.

Tuesday, November 13, 2012

permutation testing: groups of people

Here I describe three ways to do a group permutation test. The terminology and phrasing is my own invention; I'd love to hear other terms for these ideas.

First, the dataset.

I have data for n people. Assume there are no missings, the data is balanced (equal numbers of both cases), and that the number of examples of each case is the same in all people (e.g. 10 "a" and 10 "b" in each of three runs for every person).
For each person there is a single "real accuracy": the classification of their data with the true labeling. We can average these n values for the real group average accuracy.

Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person. In other words suppose relabeling 1 = abaabba ... aabbb. The "perm 1" box for each person is the accuracy obtained when relabeling 1 is used for that person's data; permutations of the same number for each person use the same relabeling.

Next, we generate the null distribution.

Here are the three schemes for doing a group permutation test. In the first, the linking between permutations described above is maintained: p group means are calculated, one for each of the p unique relabelings. I've used this scheme before.

Alternatively, we could disregard the linking between permutations for each person, selecting one accuracy from each person to go into each group mean. Now we are not restricted to p group means; we can generate as many as we wish (well, restricted by the number of subjects and permutations, but this is usually a very, very large number). In my little sketch I use the idea of a basket: for each of the r group means we pick one permutation accuracy from each person at random, then calculate the average. We sample with replacement: some permutations could be in multiple baskets, others in none. This is the scheme used in (part of) Stelzer, Chen, & Turner, 2012, for example.

Or, we could fill our baskets more randomly: ignoring the separation into subjects. In other words, we draw n permutation values each of r time, disregarding the subject identities. This is closer to the common idea of bootstrapping, but I don't know of any examples of using this with neuroimaging data.

Finally, the null distribution.

Once we have the collection of p or r group means we can generate the null distribution, and find the rank (and thus the p-value) of the real accuracy, as usual in permutation testing. To be complete, I usually calculate the p-value as ((number of permutation means) > (real mean)) / (r  + 1). I actually don't think this exact equation is universal; it's also possible to use greater-than-or-equal-to for the rank, or r for the denominator.

Thoughts.

I feel like the first scheme ("stripes") is the best, when it is possible. Feelings aren't proof. But it feels cleaner to use the same relabeling scheme in each person: when we don't, we are putting a source of variability into the null distribution (the label scheme used for each person) that isn't in the real data, and I always prefer to have the null distribution as similar to the real data as possible (in terms of structure and sources of variance).

A massive problem with the first scheme is that it is only possible when the data is very well-behaved: the same relabeling scheme can be applied to all subjects. As soon as even one person has some missing data, the entire scheme breaks. I've used the second scheme ("balanced baskets") in this case, usually by keeping the "striping" whenever possible, then bootstrapping the subjects with missing (This is sort of a hybrid of the first two schemes: I end up with p group means, but the striping isn't perfect).

A small change to the diagram gets us from the second to the third scheme (losing the subject identities). This of course steps the null distribution even further from the real data, but that is not necessarily a bad thing. Stelzer, Chen, & Turner (2012) describe their scheme 2 as a "fixed-effects analysis" (line 388 of the proof). Would scheme 3 be like a random-effects analysis?

So, which is best? Should the first scheme be preferred when it is possible? Or should we always use the second? Or something else? I'll post some actual examples (eventually).

Wednesday, October 31, 2012

needles and haystacks: information mapping quirks

I really like this image and analogy for describing some of the distortions that can arise from searchlight analysis: a very small informative area ("the needle") can turn into a large informative area in the information map ("the haystack"), but the reverse is also possible: a large informative area can turn into a small area in the information map ("haystack in the needle").

I copied this image from the poster Matthew 
Cieslak, 
Shivakumar
 Viswanathan, 
and 
Scott 
T. 
Grafton
 presented last year at SfN (poster 626.16, Fitting and Overfitting in Searchlights, SfN2011). The current article covers some of the same issues as the poster, providing a mathematical foundation and detailed explanation.

They step through several proofs of information map properties, using reasonable assumptions. One result I'll highlight here is that the information map's representation of a fixed-size informative area will grow as searchlight radius increases (my phrasing, not theirs). Note that this (and the entire paper) is describing the  single-subject, not group level of analysis.

This fundamental 'growing' property is responsible for many of the strange things that can appear in searchlight maps, such as the edge effects I posted about here. As Viswanathan et al. point out in the paper, it also means that interpreting the number of voxels found significant in a searchlight analysis is fraught with danger: it is affected by many factors other than the amount and location of informative voxels. They also show that it is possible to have just 430 properly-spaced informative voxels create the entire brain to be marked as informative in the information map, using just 8 mm radius searchlights (that's not particularly large in the literature).

I recommend taking a look at this paper if you generate or interpret information maps via searchlight analysis, particularly if you have a mathematical bent. It nicely complements diagram- and description-based explanations of searchlight analysis (including, hopefully soon, my own). It certainly does not include all the aspects of information mapping, but provides a solid foundation for those it does include.


ResearchBlogging.org Shivakumar Viswanathan, Matthew Cieslak, & Scott T. Grafton (2012). On the geometric structure of fMRI searchlight-based information maps. arXiv: 1210.6317v1

Thursday, October 25, 2012

permuting searchlight maps: Stelzer

Now to the proposals in Stelzer, not just their searchlight shape!

This is a dense methodological paper, laying out a way (and rationale) to carry out permutation tests for group-level classifier-based searchlight analysis (linear svm). This is certainly a needed topic; as pointed out in the article, the assumptions behind t-tests are certainly violated in searchlight analysis, and using the binomial is also problematic (they suggest that it is too lenient, which strikes me as plausible).

Here's my interpretation of what they propose:
  1. Generate 100 permuted searchlight maps for each person. You could think of all the possible label (i.e. class, stimulus type, whatever you're classifying) rearrangements as forming a very large pool. Pick 100 different rearrangements for each person and do the searchlight analysis with this rearrangement. (The permuted searchlight analysis must be done exactly as the real one was - same cross-validation scheme, etc.)
  2. Generate 100,000 averaged group searchlight maps. Each group map is made by picking one permuted map from each person (out of the 100 made for each person in step 1) and averaging the values voxel-wise. In other words, stratified sampling with replacement.
  3. Do a permutation test at each voxel, calculating the accuracy corresponding to a p = 0.001 threshold. In other words, at each voxel you record the 100th biggest accuracy after sorting the 100,000 accuracies generated in step 2. (100/100000 = 0.001)
  4. Threshold the 100,000 permuted group maps and the one real-labeled group map using the voxel-wise thresholds calculated in step 3. Now the group maps are binary (pass the threshold or not).
  5. Apply a clustering algorithm to all the group maps. They clustered voxels only if they shared a face. I don't think they used a minimum cluster size, but rather called un-connected voxels clusters of size 1 voxel. (This isn't really clear to me.)
  6. Count the number of clusters by size in each of the 100,000 permuted maps and 1 real map. (this gives counts like 10 clusters with 30 voxels in map #2004, etc.)
  7. Generate the significance of the real map's clusters using the counts made in step 6. I think they calculated the significance for each cluster size separately then did FDR, but it's not obvious to me ("Cluster-size statistics" section towards end of "Materials and Methods").
  8. Done! The voxels passing step 7 are significant at the cluster level, corrected for multiple comparisons (Figure 3F of paper). The step 4 threshold map can be used for uncorrected p-values (Figure 3E of paper).

Most of this strikes me as quite reasonable. I've actually previously implemented almost this exact procedure (minus the cluster thresholding) on a searchlight dataset (not linear svms).

The part that makes me twitch the most is step 2: turning the 100 maps for each person into 100,000 group-average maps. I've been wanting to post about this anyway in the context of my ROI-based permutation testing example. But in brief, what makes me uncomfortable is the way 100 maps turn into 100000. Why not just calculate 5 for each person?  5^12 >> 100,000 (they had 12 subjects in some of the examples).  Somehow 100 for each person feels more properly random than 5 for each person, but how many are really needed to properly estimate the variation? I will expand on this more (and give a few alternatives), hopefully somewhat soon.

 The other thing that makes me wonder is the leniency. They show (e.g. Figure 11) that many more voxels are called significant in their method than with a t-test, claiming that as closer to the truth. This relates to my concern about how to combine over subjects: using 100,000 group maps allows very small p-values. But if the 100,000 aren't as variable as they should be, the p-values will be inflated.


ResearchBlogging.orgStelzer, J., Chen, Y., & Turner, R. (2012). Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. NeuroImage DOI: 10.1016/j.neuroimage.2012.09.063



UPDATE (30 October): We discussed this paper in a journal club and a coworker explained that the authors do explain the choice of 100 permutations per person in Figure 8 and the section "Undersampling of the permutation space". They made a dataset with one searchlight and many examples (80, 120, 160), then varied the number of permutations they calculated for each individual (10, 100, 1000, 10,000). They then made 100,000 group "maps" as before (my step 2), drawing from each group of single-subject permutations. Figure 8 shows the resulting histograms: the curves for 100, 1000, and 10,000 individual permutations are quite similar, which they use as rationale for running 100 permutations for each person (my step 1).

I agree that this is a reasonable way to choose a number of each-person permutations, but I'm still not entirely comfortable with the way different permutation maps are combined. I'll explain and show this more in a separate post.

searchlight shapes: Stelzer

This is the first of what will likely be a series of posts on a paper in press at NeuroImage:

Stelzer, J., et al., Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA). NeuroImage (2012), http://dx.doi.org/10.1016/j.neuroimage.2012.09.063

There is a lot in this paper, touching some of my favorite topics (permutation testing, using the binomial, searchlight analysis, Malin's 'random' searchlights).
But in this post I'll just highlight the searchlight shapes used in the paper. They're given in this sentence: "The searchlight volumes to these diameters were 19 (D=3), 57 (D=5), 171 (D=7), 365 (D=9), and 691 (D=11) voxels, respectively." The authors don't list the software they used; I suspect it was custom matlab code.

Here I'll translate the first few sizes to match the convention I used in the other searchlight shape posts:
diameter radius number of surrounding voxels notes
3 1 18 This looks like my 'edges or faces touch' searchlight. 
5 2 56 This has more voxels than the 'default' searchlight, but less than my two-voxel radius searchlight. Squinting at Figure 1 in the text, I came up with the shape below.


Here's the searchlight from Figure 1, and my blown-up version for a two-voxel radius searchlight.
It looks like they added the plus signs to the outer edges of a three-by-three cube. This doesn't follow any of my iterative rules, but perhaps would result from fitting a particular sphere-type rule.

Tuesday, October 23, 2012

many options: frightening results

If you need a good scare this Halloween season, I suggest reading On the plurality of (methodological) worlds: estimating the analytic flexibility of fMRI experiments.

This is not an MVPA paper, but I have no doubt its conclusions are just as relevant to MVPA. Joshua Carp took one fMRI dataset and constructed  no less than 34,560 significance maps (i.e. mass-univariate group statistical maps), deriving from 6,912 analysis pipelines (e.g. smoothing or not, slice-time correction or not) and various statistical choices (e.g. FDR or RFT corrections).

The scary part is that these various analysis choices are all reasonable and in use, but produced different results. To quote from his conclusion, "While some research outcomes were relatively stable across analysis pipelines, others varied widely from one pipeline to another. Given the extent of this variability, a motivated researcher determined to find significant activation in practically any brain region will very likely succeed–as will another researcher determined to find null results in the same region."

For just one highlight, Figure 4 shows the peak voxels identified in the 6,912 pipelines, color-coded by the number of pipelines in which each was peak. The color bar maxes at 526: no voxel was peak in more than 526 of the 6,912 maps. But the peaks are not distributed randomly: they're grouped in anatomically sensible ways (which is good).

This particular map reinforces my bias towards ROI-based analyses: should we really be interpreting tiny blobs or coordinate locations when they can be so susceptible to being shifted by reasonable analysis choices?

I am reminded of Simmons et. al's recommendations for describing results. We must be more disciplined and stringent about the sensitivity of our results to somewhat arbitrary choices, and more forgiving to less-than-perfect results when reviewing.

I certainly don't think that these results indicate that we should all give up, abandoning all fMRI analysis. But we should be even more skeptical about our results. Do they only appear in one 'magic' pipeline? Or do they more-or-less hold over perturbations in thresholds and processing?

ResearchBlogging.orgCarp, J. (2012). On the Plurality of (Methodological) Worlds: Estimating the Analytic Flexibility of fMRI Experiments Frontiers in Neuroscience, 6 DOI: 10.3389/fnins.2012.00149

Ah, I see that Neuroskeptic also commented on this paper.

Monday, October 22, 2012

Thursday, October 18, 2012

some SfN reflections

I'm now back from SfN and sorting through my notes. It was great to see familiar people again, and to meet some people I only knew from email and articles.

I thought I'd share here a few of my MVPA-related impressions, in no particular order. These are of course personal and no claim to be representative; please share if you found something different or think I missed a trend.
  • RSA and searchlight analysis are quite popular. I can't remember seeing an analysis using only ROI-based MVPA. I saw several analyses combining searchlight and RSA (e.g. searching the brain for spheres with a particular RSA pattern).
  • Linear classifiers (mostly svms) and Pearson correlation are very widely used. I saw a few nonlinear svms, but not many. Some posters included diagrams illustrating a linear svm, while others simply mentioned using MVPA, with the use of a linear svm being implicit.
  • Feature selection ("voxel picking") is a major concern. Multiple people mentioned having no real idea which methods should be considered, much less knowing a principled, a priori way to choose a method for any particular analysis. This concern probably feeds into the popularity of searchlight methods.
  • I saw quite a few efforts to relate (e.g. correlate) classification results with behavioral results and/or subject characteristics.
  • Multiple studies did feature selection by choosing the X most active voxels (as determined by a univariate test on the BOLD), within the whole brain or particular regions.
I'll share some detailed thoughts and explanations of a few of the methods I ran across; hopefully sooner rather than later ... and I never did hear what the conference logo was supposed to be. It certainly was ubiquitous, though.

Monday, October 8, 2012

me and MVPA at SfN

I'll be at SfN this weekend/next week. Contact me if you'd like to chat about anything fMRI analysis related. I also have a poster: 390.06, Monday morning; stop on by.

Some of us using MVPA are having an informal gathering Tuesday (16 October), 6:30 pm. Contact me for the details if you'd like to join.

By the way, anyone know what this year's SfN logo is supposed to be? I find it oddly disturbing.

Monday, October 1, 2012

searchlight shapes: BrainVoyager

Rainer Goebel kindly provided a description and images of the searchlight creation used in BrainVoyager:

"BrainVoyager uses the "sphere" approach (as in our original PNAS paper Kriegeskorte, Goebel, Bandettini 2006), i.e. voxels are considered in a cube neighborhood defined by the radius and in this neighborhood only those voxels are included in the "sphere" that have an Euclidean distance from the center of less than (or equal to) the radius of the sphere. From your blog, I think the resulting shapes in BrainVoyager are the same as for pyMVPA.

Note, however, that in BrainVoyager the radius is a float value (not an integer) and this allows to create "spheres" where the center layer has a single element on each side at cardinal axes (e.g. with radius 1, 2, 3, 4... voxels, see snapshot below) but also "compact" spheres as you seem to have used by setting the radius, e.g. to 1.6, 1.8, 2.6, 2.8, 3.6...). "

At right is an image Rainer generated showing  radius 1.0 and 2.0 searchlights created in BrainVoyager.

I am intrigued by Rainer's comment that using non-integer radii will make more "compact" spheres; non-interger radii underscores the need to be explicit in describing searchlight shape in methods sections. It appears that pyMVPA requires integer radii, but the Princeton MVPA Toolbox does not.

Tuesday, September 25, 2012

random searchlight averaging

I just finished reading several papers by Malin Björnsdotter describing her randomly averaged searchlight method and am very excited to try it.

In brief, instead of doing an exhaustive searchlight analysis (a searchlight centered on every voxel), searchlights are positioned randomly, such that each voxel is in just one searchlight. The parcellation is repeated, so that each voxel is included in several different searchlights (it might be in the center of one, touching the center of another, in the edge of a third, etc.). You then create the final information map by assigning each voxel the average of all the searchlights it's a part of - not just the single searchlight it was the center of as in normal searchlight analysis.

Malin introduces the technique as a way to speed searchlight analysis: far fewer searchlights need to be calculated to achieve good results. That's certainly a benefit, as searchlight analyses can take a ridiculous amount of time, even distributed across a cluster.

But I am particularly keen on the idea of assigning each voxel the average accuracy of all the searchlights it's a part of, rather than the single centered searchlight accuracy. In this and other papers Malin shows that the Monte Carlo version of searchlight analysis produces more accurate maps than standard searchlighting, speculating that the averaging reduces some of the spurious results. I find this quite plausible: a single searchlight may be quite accurate at random; the chance high accuracy will be "diluted" by averaging with the overlapping searchlights. Averaging won't eliminate all distortions in the results of course, but might help mitigate some of the extremes.

Malin points out that averaged information maps can be constructed after doing an exhaustive searchlight mapping, not just a Monte Carlo one. I will certainly be calculating some of these with my datasets in the near future.


ResearchBlogging.org Björnsdotter M, Rylander K, & Wessberg J (2011). A Monte Carlo method for locally multivariate brain mapping. NeuroImage, 56 (2), 508-16 PMID: 20674749

Monday, September 24, 2012

searchlight shapes: searchmight

The searchmight toolbox was developed by Francisco Pereira and the Botvinick Lab and introduced in Information mapping with pattern classifiers. The toolbox is specialized for searchlight analysis, streamlining and speeding the process of trying different classification algorithms on the same datasets.

Francisco said that, by default, the toolbox uses cubical searchlights (other shapes can be manually defined). In the image of different one-voxel radius searchlights, this is the one that I labeled "edge, face, or corner": 26 surrounding voxels for a one-voxel radius searchlight. Larger radii are also cubes.


Pereira, Francisco, & Botvinick, Matthew (2011). Information mapping with pattern classifiers: A comparative study. NeuroImage. DOI: 10.1016/j.neuroimage.2010.05.026

Wednesday, September 19, 2012

pyMVPA searchlight shape: solved

Michael Hanke posted further description of the pyMVPA searchlight: it is the same as the Princeton MVPA toolbox searchlight I described earlier.

As Michael points out, it is possible to define other searchlight shapes in pyMVPA; this is the default for a sphere, not the sole option.

searchlight shapes: a collection

The topic of searchlight shapes is more complex than I'd originally expected. I'd like to compile a collection of what people are using. I added a "searchlight shapes" label to the relevant posts: clicking that label should bring up all of the entries.

I have contacted some people directly in the hopes of obtaining sufficient details to add their searchlight shapes to the collection. If you have details of a searchlight not included, please send them and I'll add it (or arrange for you to post them). My initial goal is to describe the main (i.e. default) spherical searchlights produced in the most widely used code/programs, but I also plan to include other types of searchlights, like the random ones created by Malin Björnsdotter.

Tuesday, September 18, 2012

permutation testing: one person only

Here are the results and specifics of the permutation testing for one (real) person from the dataset I introduced in the previous post.

Combining the first six runs (the first training set) this person has 19 usable examples of class a and 31 of class b. In the last six runs (the first testing set) there are 22 usable examples of class a and 18 of class b. The data is thus quite out of balance (many more b in the training set and many more a in the testing). Balancing the dataset requires selecting 19 of the class b examples from the training set, and 18 of class a from the testing set. I did this selection randomly 10 times, which I will call "splits".

The permutation test's aim is to evaluate the classification significance: is the accuracy by which we classified the stimuli as a or b meaningful? I do this by permuting the class labels in the dataset, so in any particular permutation some class a examples are still labeled a, while others are labeled b. But in this case, even though it's just one person, we have ten datasets: the splits.

I handle this by precalculating the label permutations, then running the permutation test on each of the splits separately and averaging the results. To make it more concrete, this person has (after balancing) 19 examples of each class in the training set. I sort the rows so that they are grouped by class: the real labeling is 19 a followed by 19 b. I store 1000 random label rearrangements, each containing 19 a and 19 b. For example, here's the new labeling for permutation #1:  "a" "b" "b" "a" "b" "a" "b" "a" "a" "a" "b" "b" "b" "a" "a" "b" "a" "b" "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "a" "b" "a" "a" "b" "b" "b" "b" "a" "b". I replace the real class labels with this set, then run the usual classification code - for each of the 10 splits.

When everything is done I end up with 1000 permuted-labels * 10 splits accuracies. I then make a set of 1000 accuracies by averaging the accuracy obtained with each label permutation on each split. In other words, I take the average of the accuracies I got by using the abbababaaa... labeling on each of my 10 "split" datasets. I then compute the significance by taking the rank of the real accuracy over the number of permutations. In this case, that's 2/1001 = 0.002.

Here's how it actually looks. These are histograms of the null distributions obtained on each of the split datasets separately, plus the averaged one. The vertical blue lines give the accuracy of the true labeling on each split, while the vertical light-grey line is at 0.5 (chance). The first graph is overplotting density plot versions of the 10 splits' histograms.


This method obviously requires a lot of computations (I often use a cluster). But I like that it enables comparing apples to apples: I'm comparing the accuracy I got by averaging 10 numbers to a distribution of accuracies obtained by averaging 10 numbers. Since the same set of splits and relabelings is used in all cases, the averaging is meaningful.

In this case, as often, the averaged distribution has less variance than the individual splits. This is not surprising, of course. It's also clear that the true-labeling accuracies vary more than the permutation distributions for each split, which is probably also not surprising (since these are permutation distributions), but not as clean as might be hoped. I do like that the true accuracy is at the right of all of the distributions; I consider this amount of variability pretty normal. I'd be worried if the distributions varied a great deal, or if the true-label accuracies are more variable (e.g. at chance on some splits, > 0.7 on others).

permutation testing: example dataset

I currently have a dataset that while mostly straightforward to analyze, has some complications that arise frequently. In this post I'll sketch some critical aspects to hopefully clarify the permutation testing issues.
  • The analysis is within-subjects (each person classified alone) with anatomical ROIs. The ROIs are reasonably small and independently defined; we will not do additional feature selection.
  • The subjects completed 12 scanning runs, with the stimuli reasonably-separated (time-wise), and randomly ordered.
  • There are various types of stimuli, but we are only interested in certain two-way classifications. 
  • We only want to include examples where the subject gave the correct response.
  • We will do linear support vector machines, c=1.
As frequently happens, there are not as many examples as I might prefer. The stimuli types were randomized and balanced across the experiment, but not such that each run was guaranteed to have the same number of examples of the types we're classifying. After removing examples the subjects answered incorrectly, the imbalance is even worse: not all stimulus types are present in every run for every person.

There are various ways to set up the cross-validation. Partitioning on the runs is not an option, since some runs don't have all the stimulus types. My goal was to find a partitioning scheme that is as simple as possible, but separates the training and testing sets in time while getting the number of examples of each class in the training and testing sets as close as possible.

I calculated the number of training and testing examples for several schemes, such as leave-four-runs-out and leave-three-runs-out, settling on leave-six-runs-out. Specifically, I use the first six runs as the training set and the last six runs (in presentation order) as the testing set, then the reverse. Using this scheme, the smallest number of examples in a partition for any person is 6; not great, but I guessed workable. The largest minimum for any person is 25; most people have in the teens.

Even with leave-six-runs-out partitioning the number of examples of the two classes in the training and testing set is usually unequal, sometimes strikingly so. My usual work-around is to subset the larger class to equalize the number of examples in each case (e.g. if there are 12 of class a and 14 of class b in the training set, take out two of the class b). There are of course many, many ways to do the subsetting. My usual practice is to randomly calculate 10 of these, then average the resulting accuracies. (e.g. calculate the accuracy when taking out the 1st and 3rd class b examples in the training set, then after taking out the 4th and 13th examples, etc.).

Monday, September 17, 2012

permutation testing: opening thoughts

This is the first of what will probably be a continuing series of musing, meandering posts about permutation testing for group fMRI MVPA.

I'm a big fan of permutation testing, especially for MVPA: our questions, analyses, sample sizes, distributions, etc. are so far from what was envisioned for parametric stats that using a "standard" parametric test "out of the box" gives me pause (or, in practice, lots of assumption-checking). I like that with permutation tests it's possible to design a test that gets at exactly the hypothesis we're asking, while keeping the relevant variance structures (e.g. how many runs, type of classifier, number of examples in each cross-validation fold) intact.

But, the exact way two different people design a permutation test for any particular question will probably be different. And this gets worse in group analyses.

I'll be writing a series of posts illustrating some of the common ways permutation tests are done, both single subject and group analyses. My intent is that these posts will highlight and explain how to carry out what I think are the most useful and common methods, but also share my perception of where some of the open questions lie. As always, please feel free to comment, especially pointing out where your practice/observation/experience/conclusions differ from mine.


Wednesday, September 12, 2012

more searchlights: the Princeton mvpa toolbox

MS Al-Rawi kindly sent me coordinates of the searchlights created by the Princeton MVPA toolbox, which I made into this image. As before, the center voxel is black, the one-voxel-radius searchlight red, the two-voxel-radius searchlight purple, and the three-voxel-radius searchlight blue.

The number of voxels in each is:
1-voxel radius: 6 surrounding voxels (7 total)
2-voxel radius: 32 surrounding voxels (33 total)
3-voxel radius: 122 surrounding (123 total)
4-voxel radius: 256 surrounding (257 total)

From a quick look at the code generating the coordinates, it appears to "draw" a sphere around the center voxel, then retrieve the voxels falling within the sphere. (anyone disagree with this characterization?)

This "draw a sphere first" strategy explains why I couldn't come up with this number of voxels for a three-voxel-radius searchlight if the shape follows additive rules (e.g. "add voxels that share a face with the next-size-smaller radius"). It's another example of how much it can vary when different people actually implement a description: to me, it was more natural to think of searchlights of different radii as growing iteratively, a rather different solution than the one used in the Princeton MVPA toolbox.

Tuesday, September 11, 2012

my searchlight, and some (final?) thoughts

Since I've been thinking about searchlight shapes, here's a diagram of the two-voxel radius searchlight I'm currently using in a few projects. As in the previous post, the center voxel is black, the one-voxel radius surrounds red, and the two-voxel radius surrounds purple.

This two-voxel radius searchlight has 13 + 21 + 24 + 21 + 13 = 92 voxels in the surround, which makes it larger than the three-voxel radius faces-must-touch searchlight (at 86 surrounding voxels).

So how should we grow searchlights, edge-face-corner, edge-face, faces-only? Beats me. I can imaging comparing the results of a few different shapes in a particular dataset, but I doubt there's a single shape that is always best. But the searchlight shape should be added to our list of how to describe a searchlight analysis.

PS - I use R code to generate lists of adjacent voxels for running searchlight analyses; contact me if you'd like a copy/details.

pyMVPA searchlight shapes

Michael Hanke listed the number of voxels for different searchlight radii in pyMVPA. He also wrote that pyMVPA considers neighboring voxels to be those that share two edges (i.e., a face).

Here's a diagram of following the faces-touching rule. The center voxel is black, the one-voxel radius surrounds red, the two-voxel radius surrounds purple, and the three-voxel radius surrounds blue.
This gives 1 + 9 + 21 + 24 + 21 + 9 + 1 = 86 surrounding voxels for the three-voxel radius.

But Michael wrote that a three-voxel radius searchlight in pyMVPA has 123 voxels (counting the center). If so, at left is how I could come by that count: 13 + 28 + 40 + 28 + 13 = 122.

This shape is not quite symmetrical: there are four voxels to the left, right, front, and back of the center voxel but only two above and below.

Anyone know if this is what pyMVPA does?

UPDATE: Neither of these sketches is correct, see pyMVPA searchlight: solved.

Monday, September 10, 2012

searchlight shapes

How exactly the searchlight was shaped is one of those details that is usually not mentioned in searchlight analysis papers but is not obvious, mostly since voxels are cubes.

Nikolaus Kriegeskorte (the father of searchlight analysis) uses the illustration on the left.








Expanded, I think this would look like the picture at left. There are 32 voxels surrounding the center; the center voxel in black and surrounding in blue.


But there are other ways of thinking of a searchlight. Here are three ways of defining a one-voxel radius searchlight.

Personally, I implemented the middle version (edges and faces, not corners, 18 voxels) as a one-voxel radius searchlight in my code. Larger radii are calculated iteratively (so for a two-voxel radius, combine the one-voxel radius searchlights defined by treating all 18 surrounding voxels present in the one-voxel radius searchlight as centers). I don't think one version of defining surrounds is necessarily better than others, but we do need to specify the voxels we are including.


update 11 September: I changed the caption of the searchlight shape images for clarity.

Friday, August 17, 2012

what MVPA detects

MVPA is not a universal solution for fMRI; it's always necessary to carefully think about what types of activity changes you want to detect.

For example, suppose these are timecourses for two people completing a task with two types of trials. It's clear that in both people there is a very strong effect of the trials: for person #1 the BOLD goes up during trial type A and down during trial type B; the reverse is true for person #2.


"Standard" MVPA (e.g. linear svm) will detect both of these patterns equally well: there is a consistent difference between trial types A and B in both people. In addition, the difference in direction is usually not reflected in the analysis: often only each subject's accuracy is taken to the second level.

This can be a feature, or a bug, depending on the hypotheses: If you want to identify regions with consistent differences in activation in each person, regardless of what those differences are, it's a feature. If you want to identify regions with a particular sort of difference in activation, it can be a bug.

My suggestion is usually that if what you want to detect is a difference in the amount of BOLD (e.g. "there will be more BOLD in this brain region during this condition compared to that condition") then it's probably best to look at some sort of mass-univariate/GLM/SPM analysis. But if you want to detect consistent differences regardless of the amount or direction of BOLD change (e.g. "the BOLD in this brain region will be different in my conditions"), then MVPA is more suitable.

Note also that a linear svm is perfectly happy to detect areas in which adjacent voxels have opposite changes in BOLD - the two timecourses above can be within the same ROI yet be detected quite well as an informative area. As before, this can be a feature or a bug. So, again, if you want to detect consistent regional differences in the overall amount of BOLD, you probably don't want to use "standard" MVPA.

Thursday, August 2, 2012

Nipype: comments from Yaroslav

Yaroslav Halchenko was kind enough to provide some detailed feedback on my previous post, which he allowed me to excerpt here.
"a decent understanding of both nipype and the programs it's calling (from fsl or whatever)."
Yaroslav: yes and no. As with any software -- decent understanding is desired but not strictly necessary. nipype comes with "pre-cooked" workflows for some standard analyses so you would just need to tune the script/parameters up to input your data/design.
"as most useful to a skilled programmer wanting to run several versions of an analysis" 
Yaroslav: That is one of the cases. I could argue that the most useful is an efficient utilization of computing resources. i.e running many computations in parallel (e.g. per each functional run/ subject etc). 

also -- noone runs their analysis ones with all parameters decided a priory and not 'tuning them up' -- that is also where nipype comes useful since it would be smart to recompute only what needs to be recomputed. Thus possibly helping to avoid human errors.

 as for a "skilled programmer" -- well... somewhat -- scripting nipype may be could be easier but altogether it is not that bad actually. Just as with any scripting/hacking with practice comes (greater) efficiency. 

Jo: I saw mentions of using it in parallel. Our computer support people had a devil of a time (and basically failed) with python on our shared systems in the past. But we're also having problems with matlab/spm scripts tying up way too much of the systems and not getting properly managed. So that might be worth a closer look for us.
"It's often necessary to check initial output before moving to the next step in the pipeline; these errors would need to be caught somehow (go back and check intermediate files?)" 
Yaroslav: you could code your "pipeline" incrementally indeed and check all the intermediate results... or just run it once and then check them and rerun pipeline if you would need to change any of the parameters. I do not see it being much different from what you would get with other software -- once again boiling down to how particular/anal you want to be about checking your data/results. Also you could code/assemble pieces which would provide you with reports of the specific processing steps so you could later on glance over them rapidly.
"Nipype will need to be updated any time one of the underlying programs is updated." 
Yaroslav: again -- yes and no -- nipype (interfaces) would need to be updated only if corresponding programs' command line interface gets changed in some non-backward compatible fashion. Most of the software suites avoid such evil actions; but yes -- nipype would need to be (manually) updated to accommodate new cmdline options, and at times remove some if they were removed from the original software.

Jo: And I would think the defaults. It'd be troublesome if, say, a default Nipype motion-correction routine used a different motion-correction algorithm specification than the GUI version.


Yaroslav: well -- they are trying to match the defaults and as long as it doesn't change and you do not mix GUI/non-gui -- how do you know that GUI's is different/better.

Jo: I don't know that the GUI options are better. But practically speaking, many people (at least with spm) use the default options for most choices, so I was thinking we'd need to confirm that the options are the same running it through the GUI or Nipype, just for consistency.
"I've had terrible troubles before with left/right image flipping and shifting centerpoints when changing from one software package to another (and that's not even thinking about changing from one atlas to another). Trying to make packages that have different standard spaces play nicely together strikes me as very non-trivial..."
Yaroslav:  RIGHT ON -- that is indeed one of the problems I have heard about (mixing AFNI and FSL IIRC) and to say the truth I am not sure how far nipype could help assuring correct orientation... if that is possible at all -- it would be a good question to nipype folks I guess. BUT many tools do play nicely together without doing all kinds of crazy flips, so, although it is a valid concern, it is manageable (run pipeline and verify correct orientation by inspecting results through the steps) and might be limited to only a limited set of abusers.

"... Second, would there be a benefit to using Nipype if the pipeline only includes one program (e.g. why Nipype instead of spm8 batching)?"
Yaroslav:   well -- not everyone is using spm8 and those tools might lack back processing/dependencies tracking etc... and even with spm8 -- you might like eventually to add few additional steps provided by other toolkits and would be ready to do so with ease (given you also check for possible flip).

Monday, July 30, 2012

Nipype impressions

I was asked to take a look at Nipype (great logo, Arno!). Here are some of my impressions; I haven't tried coding or installing it. If you have, I'd be curious to hear how it's gone and it you think any of this is off the mark.

Nipype aims to connect different fMRI software packages (spm, fsl, etc) so that you can program a reproducible set of steps. Your preprocessing could mix programs (e.g. smooth in one program, realign in another), and it should be easier to switch a step from one program to another (e.g. smooth in fsl instead of spm) with relatively few changes to the program.

I like that Nipype makes it possible to reproduce and share analysis steps; the current procedures are far too messy and error-prone. Nipype looks to require a fair bit of programming skill (python), however, as you're coding at a step above the underlying routines/programs: to do it properly you'd need a decent understanding of both nipype and the programs it's calling (from fsl or whatever).

Nipype strikes me as most useful to a skilled programmer wanting to run several versions of an analysis (e.g. with different motion-correction algorithms) or the same analysis many, many times (lots of subjects). It makes it more straightforward to make an established set of analysis steps (such as preprocessing) available to other people. It's hard to tell without trying it out, but it looks like a nipype program could be set up to be run by someone that doesn't know the underlying routines (e.g. an RA running the same preprocessing steps on a new group of subjects).

I have concerns about missing errors, specifying options, and stability. It's often necessary to check initial output before moving to the next step in the pipeline; these errors would need to be caught somehow (go back and check intermediate files?). Most routines have many options (e.g. FWHM, interpolation algorithm). Does Nipype set the default values? Making every single option visible through Nipype seems daunting, which is one of my concerns about stability: Nipype will need to be updated any time one of the underlying programs is updated. And installation would likely be non-trivial (all the necessary dependencies and paths), unless it's run in NeuroDebian.

A few final thoughts: First, I've had terrible troubles before with left/right image flipping and shifting centerpoints when changing from one software package to another (and that's not even thinking about changing from one atlas to another). Trying to make packages that have different standard spaces play nicely together strikes me as very non-trivial (I don't mean code-wise, I mean keeping track of alignment). Second, would there be a benefit to using Nipype if the pipeline only includes one program (e.g. why Nipype instead of spm8 batching)? It seems in this case it might just be adding extra complication: the need to learn the program options, plus the Nipype code.

So will I recommend we try Nipype right now? No. But I'll keep an eye on it and am certainly willing to revise my opinion.


ResearchBlogging.org Gorgolewski K, Burns CD, Madison C, Clark D, Halchenko YO, Waskom ML, & Ghosh SS (2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5 PMID: 21897815

Thursday, July 12, 2012

but replications are possible

Considering the posts on all the reasons that fMRI findings can be dodgy and replications unlikely, I'll share some recent experiences I've had in which proper replications succeeded.

I can't go into any detail (yet), but this image is a few slices showing the results with the first dataset in red and the second in blue. We definitely tweaked the analysis of the first dataset a bit; I did as much of the tweaking as possible with a control classification rather than the real one, but I can't claim that all the choices were made on a priori principle. But I would claim that the choices were made without any data peeking for the second dataset: the blue blobs are the very first set of across-subjects results; just the analysis we used for the first dataset. And I'm extra-pleased by these results because they are properly independent: different scanner, different people, even different stimuli (but the same task and experimental questions, of course).

Within the last week I've actually had another case of replicating results in properly independent datasets; a completely different set of analyses than the blobs above. It's an anatomical ROI-based analysis, so I don't have pretty pictures to show, but the tables are very lovely.

So do not give up hope: replications are possible. But we must be very vigilant in our methods: we are all easy to fool.

Monday, July 9, 2012

many, many options (continued)

Thanks for the comments on the previous post; here's a few thoughts and responses.

optimizing

Jonas Kaplan mentioned optimizing the methodology on a pilot subject or two, which are then not included in the main analysis. That can be a great solution, especially if the pilot subjects are collected ahead of time and need to be collected regardless.

Something that's worked for me is figuring out the analysis using a positive control classification, then applying it to the real classification. For example, classifying the button presses (did the person push the first-finger or second-finger response button?). Which button was pressed is of course not of interest, but if I can't classify button pressing using motor voxels, I know something is not right, and I can use button-pressing classification accuracy as a sort of yardstick (if button-presses are classified around 80% accuracy and the task I actually care about at 75%, that's probably a strong signal).

Neither strategy (testing out methodology on different subjects or classifications) is perfect (the pilots could be much different than the other subjects, the control classification could have a radically different signal than the classification of interest, etc.), but they're better than nothing, or trying out many different analyses with the real data.

how big an impact?

An anonymous person commented that they've found that often a variety of methodological choices produces similar results, mentioning that sometimes a change will increase accuracy but also variability, resulting in similar conclusions/significance.

My experience is similar: sometimes changing a factor has a surprisingly small impact on the results. Classifier type comes to mind immediately; I've tried both linear svms and random forests (very different classifiers) on the same datasets with very similar results. Changing the number of cross-validation folds doesn't always make a big difference, though changing schemes entirely can (e.g. partitioning on the runs, then ignoring the runs in the partitioning schemes).

I've found that some choices have made very large differences in particular datasets; enough to change the interpretation completely.
  • temporal compression. I've seen massive differences in results when compressing to one example per run instead of one example per block; much more accurate with more averaging. I interpret this as the increase in signal/noise (from averaging more volumes together) outweighing the decrease in the number of datapoints (fewer examples with more compression).
  • scaling and detrending. Some datasets, particularly when classifying across subjects, have very different results depending on which scaling (i.e. normalizing, across-volumes within a single voxel, or across voxels within a single image) technique was used.
  • balancing. This one's a bit different; I mean choosing a subset of examples if necessary to ensure that equal numbers of each class are sent to the classifier. For example, suppose we've compressed to one example per block, and are partitioning on the runs, but are only including blocks the person answered correctly. We might then have different numbers of examples in the classes in each cross-validation fold. I usually handle this by subsetting the bigger class at random - some of the examples are left out. Since there are many ways to choose which examples are left out, I usually do ten different random subsets and average the results. Sometimes which examples are included can have a very large difference in accuracy. In these cases I often look at the processing and analysis stream to see if there's a way to make the results more stable (such as by increasing the number of examples in the training set).

Monday, July 2, 2012

many, many options

Let me join the chorus of people singing the praises of the recent article "False-positive psychology". If you haven't read the paper yet, go read it, especially if you don't particularly like statistics. One of the problems they highlight is "researcher degrees of freedom." In the case of MVPA I'd summarize this as having so many ways to do an analysis that if you know what sort of result you'd like you can "play around" a bit until you find one that yields what you'd like. This isn't cheating in the sense of making up data, but more insidious: how do you determine what analysis you should perform, and when to accept the results you found?

Neuroskeptic has a great post on this topic, listing some of the researcher degrees of freedom in analyzing a hypothetical fMRI experiment:
"Let's assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data."
What are some of the options for analyzing this with MVPA? This is not an exhaustive list by any stretch, just the first few that came to mind.
 temporal compression
  • Average the volumes to one per block. Which volumes to include in the average (i.e. to account for the hemodynamic lag)?
  • Create parameter estimate images (PEIs) (i.e. fit a linear model and do MVPA on the beta weights), one per block. The linear model could be canonical or individualized.
  • Average the volumes to one per run. Calculate the averages from the block files or all at once from the raw images.
  • Create one PEI for each run.
  • Analyze individual volumes (first volume in each block, second volume in each block, etc).
classifier
  • the "default": linear svm, c=1.
  • a linear svm, but fit the c.
  • a nonlinear svm (which type?).
  • a different classifier (random forest, naive bayes, ....).
  • correlation-based
  • linear discriminants (multiple options)
cross-validation
  • on the runs
  • on a combination of runs (first two runs out, next two out, etc)
  • ignoring the runs (ten-fold, leave-three-examples-out, etc)
  • on the subjects (leave-one-subject-out)
  • on the runs, but including multiple subjects
voxels
  • whole-brain
  • ROI (anatomical, functional, hybrid)
  • searchlight (which radius? which shape? how to combine across subjects?)
  • resize the voxels?
  • scale (normalize) the data? (across voxels within an example, across examples?). Center, normalize the variance, take out linear trends, take out nonlinear trends?
Simmons, Nelson, and Simonsohn argue that we (as authors and reviewers) need to be clear about why we chose particular combinations, and how sensitive the results were to those choices. For example, there is not always a clear choice as to the best cross-validation scheme to use in a particular analysis; several may be equally valid (leave-three-out, leave-four-out, leave-five-out, ...). If you only try one scheme and report it, that's fine. But if you try multiple, then only report the one that "worked", you've really increased the odds of finding a false positive. We need to be honest about how stable the results are to these types of (sometimes arbitrary) analytical choices.



ResearchBlogging.org Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061