MVPA Meanderings: August 2013

Friday, August 30, 2013

What is a support vector machine?

This is one of the most readable, non-mathematical introductions to SVMs I've ever run across in the literature. He uses microarray data for the example, but replace "measured mRNA expression level" with "BOLD activation" and the situation is pretty much the same as in MVPA.

He includes a description of what it means to have a soft (or hard) margin; I'll add that the conventional c=1 used with linear classifiers for MVPA corresponds to a fairly soft margin: many training points are allowed to be on the 'wrong' side of the decision boundary.

William S Noble (2006). What is a support vector machine? Nature Biotechnology, 24, 1565-1567 DOI: 10.1038/nbt1206-1565

Thursday, August 22, 2013

Zhang 2013: response from Jiaxiang Zhang

Jiaxiang Zhang kindly responded to the questions and comments in my previous post, and gave permission to excerpt and summarize our conversation here. Quotes from Jiaxiang are in courier font.

ROI creation

Quite a bit of my commentary was about the ROI creation process. Jiaxiang said that my outline was essentially correct, and provided some additional details and clarification. To make it a bit more readable, here's a new version of the "ROI creation" section from the previous post, answering my questions and adding in his corrections and responses.

Do a mass-univariate analysis, finding voxels with significantly different activity during the chosen and specified contexts. (Two-sided, not "greater activity during the chosen than the specified context" as in my original version.)
Threshold the group map from #1 at p < 0.001 uncorrected, cluster size 35 voxels.

The main univariate analysis (Figure 3A and Table 1) used a FWE-corrected threshold (p<0.05). For ROI definition, a common uncorrected voxelwise threshold (p<0.001) was used, which is more liberal for subsequent MVPA. The cluster extent had very limited effects here, because all the ROIs were much larger than the extent threshold.

Use Freesurfer to make cortical surface maps of the group ROIs, then transform back into each person's 3d space.

We defined the grey-matter ROIs on a cortical surface (in Freesurfer), and then transformed back to the native space in NifTI format (which can be loaded into the Princeton MVPA toolbox).

This is an interesting issue. I have not directly compared the two, but I would guess the two methods will generate similar results.

Feature selection: pick the 120 voxels from each person, ROI, and training set with the largest "difference between conditions" from a one-way ANOVA. Concretely, this ANOVA was of the three rules (motion, color, size), performed in each person and voxel separately. I asked why they chose 120 voxels (rather than all voxels or some other subset), and if they'd tried other subset sizes.

We have not tried to use all ROI voxels, because ROIs have different sizes across regions and subjects (e.g., V1). I think it is more appropriate (debatable) to fix the data dimensions. A fixed number of voxels after the feature selection for each participant and each ROI was used, to enable comparisons across regions and participants. 120 voxels were chosen because it was (roughly) the lower limit of all 190 individual ROI’s sizes (19 participants * 10 ROIs). 2 out of the 190 individual ROIs were smaller than 120 voxels (108 and 113 voxels, respectively), for which we used all voxels in that region.

Normalize each example across the 120 voxels (for each person, ROI, and training set).

ROI stability

Several of my previous comments were about what, for lack of a better term, I'll call "ROI stability": were (pretty much) the same 120 voxels chosen across the cross-validation folds within each person, classification, ROI)? Here is our email exchange:

The mean ROI size is 309 ± 134 voxels (±s.d. across ROIs and participants). I did a simple test on our data. Across ROIs and participants, 26.60% ± 19.00% of voxels were not selected in any cross-validations (i.e., “bad” voxels), 10.56% ± 5.36% of voxels were selected only in one fold of cross-validation (i.e., “unstable” voxels), and 48.07% ± 25.34% of voxels were selected in 4 or more folds of cross-validations (i.e., “more stable” voxels). Although this is not a precise calculation of stability, the selected voxels do not change dramatically between cross-validations. It'd be good to know if there are other more established methods to quantify stability.

(my reply:) That is interesting! Unfortunately, I don't know of any "established" methods to evaluate stability in ROI definition procedures like this; yours is the only one I can think of that used a fold-wise selection procedure. Did you look at the location of any of those voxel categories - e.g. are the "bad" voxels scattered around the ROIs or clustered?

I do not see consistent spatial patterns on voxel categories.

I think this idea of stability, both how to measure it and how to interpret different degrees (e.g. what proportion of the voxels should be selected in each fold for a ROI to be considered stable?), is an interesting area and worthy of more attention.

significance testing

My questions about significance testing were more general, trying to understand exactly what was done.

First, about the group-level permutation testing for the ROI analyses:

For each of the 5000 permutations (in each participant and each ROI), classes in the training and test sets were relabelled, and the permuted data was analysed in the same procedure as the observed data. The order of the relabeling in each permutation was kept the same across participants. All the permuted mean accuracies formed the null distributions and were used to estimate the statistical significance of the observed classification accuracy.

If I understand, is this similar to the "scheme 1 stripes" in http://mvpa.blogspot.com/2012/11/permutation-testing-groups-of-people.html? In other words, you didn't bootstrap for the group null distribution, but rather kept the linking across subjects. Concretely, you calculated 5000 group-level averages for the null distribution, each of the 5000 the across-subjects average after applying the same relabling scheme to each participant. This method of doing the group-level permutation test seems quite sensible to me (when possible), but I haven't run across many examples in the literature.

Yes. Your blog post is clearer on describing this method!

How to concisely (but clearly) describe group-level permutation schemes (not to mention the impact of choosing various schemes) is another area worthy of more attention.

Next, I'd wondered how the comparisons were made between the searchlight and ROI-based results.

The peak coordinates from searchlight analysis fall within the functional ROIs in PMd, PMv and IPS. I think the similarity between the two analyses, as claimed in our paper, should not be interpreted as “equality” in terms of the blob’s shapes and extents, but their approximate anatomical locations. The two types of analysis are complementary and are based on different assumptions. The searchlight assumes that information is represented in multiple adjacent voxels (as shown in searchlight “blobs”), while the ROI analysis assumes that information is distributed within a spatially bounded region. Therefore, unless a ROI is directly defined from searchlight results, significant classification accuracy in one ROI does not necessary imply that a searchlight blob would perfectly match the spatial distribution of that ROI.

I absolutely agree that searchlight and ROI-based analyses are complementary, based on different assumptions (and detect different types of information), and shouldn't perfectly overlap. Having the 'peak' searchlight coordinate fall into particular ROIs is a reasonable way to quantify the similarity. This is another area in which there is unfortunately no standard comparison technique (that I know of); I could imagine looking at the percent overlap or something as well.

and the questions

I'd asked three questions at the end of the previous post; here are answers from Jiaxiang for the first two (the third was answered above).

Question 1:

The PEIs for both chosen and specified trials were used in that ROI analysis (no new PEIs were created).

So there were twice as many examples (all the chosen and all the specified) in the pooled analysis as either context alone.

Yes.

Question 2:

For the cross-stage classification, chosen and specified trials at the decision stage were modeled separately (chosen and specified trials at the decision stage were averaged together in the univariate analysis and the main MVPA).

So new PEIs were made, including just the decision time stage or just the maintenance stage?

The new first level model for cross-stage classification include both decision and maintenance stages, as in the main model.

A fascinating study (and methods); thanks again to Jiaxiang for generously answering my questions and letting me share our correspondence here!

Friday, August 9, 2013

indexing woes: FSL and caret

Back in June I shared the story of some headaches caused by inconsistent indexing: R (oro.nifti) and MRIcroN are 1-based, while python is 0-based. I later added SPM (1-based) and afni (0-based) to the collection; now, a kind person contributed images from doing the same test in FSL and caret: they are both 0-based.

As the person who sent the images wrote (I'm assuming wryly), ".... 'proper' programming languages (python, C) start counting/indexing from 0, while maths based applications like R and Matlab start from 1."

In FSL, the voxel at 17,15,20 is number 13 (0-based, like afni and pyMVPA), with voxel 25 at 18,16,21 (click images to enlarge).

And the same (0-based) is the case in caret, as shown in these images:

Thursday, August 8, 2013

Zhang 2013: "Choosing the Rules: Distinct and Overlapping Frontoparietal Representations of Task Rules for Perceptual Decisions"

"Choosing the Rules: Distinct and Overlapping Frontoparietal Representations of Task Rules for Perceptual Decisions" has interesting combinations of ROI-based and searchlight analyses. I think I figured out a lot of the methods, but all of the steps and choices were not clear. Frustratingly, why some of the thresholds were set as they were (e.g. for ROI definition) was not explained. Were these the only values tried? I tend to keep an eye out for explanations of why particular choices were made, especially when not "standard", aware of the ongoing discussions of problems caused by too many experimenter degrees of freedom. Few choices can really be considered standard, but I must say I look for more of an explanation when reading something like "the top 120 voxels were selected in each ROI" than "we used a linear SVM, c=1, libsvm package".

task

As usual, I'm going to focus on analysis details rather than the theoretical motivation and interpretation, but
here's a bit of introduction.

The task design is shown at right (Figure 1 from the paper). The top shows what the subject saw (and when): first, a screen with two identical (or not) task cues, then a delay, then a moving-dot stimulus, then they responded and indicated which cue (rule) they'd used to make the response.

The moving-dot stimuli always were similar, but the response the subjects were supposed to make (which button to push) depended on the rule associated with each of the three cues (respond to dot size, color, or movement direction). Additionally, two cues were always presented simultaneously, but sometimes ("specified context") the same cue was shown twice (in which case the participant was to perform that rule), while other times (the "chosen context") two different cues were shown (in which case the participant was to pick which cue of the two they wanted to follow).

To summarize, we're interested in distinguishing three rule cues (motion, color, size), across two contexts (specified and chosen). The tasks were in eight runs, with 30 trials in each run. The cue and contexts were in random order, but with full balance within each run (5 repeats of each of the 2 contexts * 3 cues).

preprocessing/general

SPM8 for motion and slice-time correction. MVPA in subject space (normalized afterwards for group-level analysis).
They created PEIs in SPM for the MVPA, one per trial (so 240, 8 runs * 30 trials/run), with the timing matched to the length of the delay periods. { note: "PEIs" are parameter estimate images, made by fitting a HRF-aware linear model to the voxel timecourses do then doing MVPA on the beta weights produced by the model. }
Classification was with linear SVMs (c=1), Princeton MVPA toolbox (and libsvm), partitioning on the runs, within-subjects (each person classified separately, then results combined across subjects to get the group-level results).

The classifications were of three main types, each of which was carried out both in ROIs and searchlights:

Rule classification, using either the chosen context only, the specified context only, or both (question #1). "Representation of chosen and specified rules: multivariate analysis" results section.
Rule cross-classification, in which classifiers were trained to distinguish rules with trials from one context, then tested with trials from the other context. "Generalization and consistency of rule representation across contexts" results section.
Stage cross-classification, in which classifiers were trained to distinguish rules using the "maintenance stage" and tested with images from the "execution stage", each context separately. This is classifying across parts of each trial, but it is not clear to me which images were classified for this analysis (question #2 below). "Invariant rule representation during maintenance and execution stages" results section.

ROI creation

NOTE: see update to this section here.

I was intrigued by how they defined the ROIs (other than BA17, which was from an anatomical atlas). Here's my guess at what they did; I'm not sure I figured out all the steps.

Do a mass-univariate analysis, finding voxels with greater activity during the chosen than the specified context. ("Establishing and maintaining the neural representations of rules under different contexts: univariate analysis" section)
Threshold the group map from #1 at p < 0.001 uncorrected, cluster size 35 voxels. 9 blobs pass this threshold, listed in Table 1 and shown in Figure 3A. (I think; two different thresholds seem to be used when describing the ROIs.)
Use Freesurfer to make cortical surface maps of the group ROIs, then transform back into each person's 3d space. (I think; is this to eliminate non-grey-matter voxels? Why back and forth from cortical-surface to volumetric space?)
Feature selection: pick the 120 voxels from each person, ROI, and training set with the largest "difference between conditions" from a one-way ANOVA. It's not clear what is meant by "conditions" in this paragraph; context, rule, both?
Normalize each example across the 120 voxels (for each person, ROI, and training set).

I could find no mention of the stability of these final 120 voxels, nor of what proportion of starting ROI voxels 120 is, nor of why these particular thresholds were chosen (and if any others were tried).

It is proper to do the feature selection on the training set only, but interpretability might be affected if stability is low. In this case, they did leave-one-run-out cross-validation with eight runs, making eight versions of each ROI for each person. If these eight versions are nearly identical, it makes sense to consider them the same ROI. But if 120 voxels is something like 1/10th of the starting voxels, and a different tenth is chosen in each of the eight versions, shouldn't we think of them as eight different ROIs? Unfortunately, stability is not addressed in the manuscript, as far as I can tell.

a few other comments

I liked that they discussed both searchlight and ROI-based results (I've argued that searchlight analyses can be more powerful combined with ROI-type analyses), but found the combination here quite difficult to interpret. I would have liked to have seen direct comparisons between the two analyses; how much do the searchlight 'blobs' (such as in Figure 5B) overlap with the 9 ROIs? Each section of the results had a phrase along the lines of "The searchlight analysis largely confirmed the ROI results (Fig. 5A), with significant rule coding in the PMv, ..." I don't see how they quantified these claims of similarity, however.
Significance testing for the ROI-based analyses was with permutation testing, following what I'd call a fold-wise scheme with both training and testing sets relabeled. How the group-level distribution was calculated is not described.
Significance testing for the searchlight analyses was by spatially normalizing then smoothing each individual's accuracy map, then doing a voxel-wise t-test for accuracy greater than chance.

questions

in the "Representation of chosen and specified rules: multivariate analysis" section it says that one analysis was after "... pooling data from the chosen and specified conditions ...". Were separate PEIs made for this analysis (i.e. fitting all trials of each rule, regardless of context), or were PEIs made for the specified and chosen trials separately, then both used for this analysis (so that there were twice as many examples in the pooled analysis as either context alone)?
Were new PEIs made for the cross-stage classification, were individual volumes classified, or something else? "Therefore, we constructed fMRI patterns of rule execution from activations at the onset of the RDK stimuli, and examined whether classifiers trained on patterns from maintenance stage could discriminate patterns from execution stage."
Are my guesses for the steps of the ROI creation technique correct? Why were these thresholds chosen? Any other techniques for ROI creation tried? How stable were the ROIs after feature selection?

Update (22 August 2013): Jiaxiang Zhang answered my questions and corrected a few parts of this post here.

Zhang J, Kriegeskorte N, Carlin JD, & Rowe JB (2013). Choosing the rules: distinct and overlapping frontoparietal representations of task rules for perceptual decisions. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33 (29), 11852-62 PMID: 23864675

FYI: 1st International Workshop on High Dimensional Data Mining (HDM 2013)

The call for papers for the 1st International Workshop on High Dimensional Data Mining (HDM 2013) ends on 17 August 2013. Not directly about MVPA, but we certainly should be interested in how to work with high dimensional data!