Thursday, August 22, 2013

Zhang 2013: response from Jiaxiang Zhang

Jiaxiang Zhang kindly responded to the questions and comments in my previous post, and gave permission to excerpt and summarize our conversation here. Quotes from Jiaxiang are in courier font.

ROI creation

Quite a bit of my commentary was about the ROI creation process. Jiaxiang said that my outline was essentially correct, and provided some additional details and clarification. To make it a bit more readable, here's a new version of the "ROI creation" section from the previous post, answering my questions and adding in his corrections and responses.

  1. Do a mass-univariate analysis, finding voxels with significantly different activity during the chosen and specified contexts. (Two-sided, not "greater activity during the chosen than the specified context" as in my original version.)
  2. Threshold the group map from #1 at p < 0.001 uncorrected, cluster size 35 voxels.
  3. The main univariate analysis (Figure 3A and Table 1) used a FWE-corrected threshold (p<0.05). For ROI definition, a common uncorrected voxelwise threshold (p<0.001) was used, which is more liberal for subsequent MVPA. The cluster extent had very limited effects here, because all the ROIs were much larger than the extent threshold.
  4. Use Freesurfer to make cortical surface maps of the group ROIs, then transform back into each person's 3d space.
  5. We defined the grey-matter ROIs on a cortical surface (in Freesurfer), and then transformed back to the native space in NifTI format (which can be loaded into the Princeton MVPA toolbox).
    I asked if they'd had better results using Freesurfer than SPM for the spatial transformation, or if there was some other reason for their procedure.
    This is an interesting issue. I have not directly compared the two, but I would guess the two methods will generate similar results.
  6. Feature selection: pick the 120 voxels from each person, ROI, and training set with the largest "difference between conditions" from a one-way ANOVA. Concretely, this ANOVA was of the three rules (motion, color, size), performed in each person and voxel separately. I asked why they chose 120 voxels (rather than all voxels or some other subset), and if they'd tried other subset sizes.
  7. We have not tried to use all ROI voxels, because ROIs have different sizes across regions and subjects (e.g., V1). I think it is more appropriate (debatable) to fix the data dimensions. A fixed number of voxels after the feature selection for each participant and each ROI was used, to enable comparisons across regions and participants. 120 voxels were chosen because it was (roughly) the lower limit of all 190 individual ROI’s sizes (19 participants * 10 ROIs). 2 out of the 190 individual ROIs were smaller than 120 voxels (108 and 113 voxels, respectively), for which we used all voxels in that region.
  8. Normalize each example across the 120 voxels (for each person, ROI, and training set).

ROI stability

Several of my previous comments were about what, for lack of a better term, I'll call "ROI stability": were (pretty much) the same 120 voxels chosen across the cross-validation folds within each person, classification, ROI)? Here is our email exchange:
The mean ROI size is 309 ± 134 voxels (±s.d. across ROIs and participants). I did a simple test on our data. Across ROIs and participants, 26.60% ± 19.00% of voxels were not selected in any cross-validations (i.e., “bad” voxels), 10.56% ± 5.36% of voxels were selected only in one fold of cross-validation (i.e., “unstable” voxels), and 48.07% ± 25.34% of voxels were selected in 4 or more folds of cross-validations (i.e., “more stable” voxels). Although this is not a precise calculation of stability, the selected voxels do not change dramatically between cross-validations. It'd be good to know if there are other more established methods to quantify stability.
(my reply:) That is interesting! Unfortunately, I don't know of any "established" methods to evaluate stability in ROI definition procedures like this; yours is the only one I can think of that used a fold-wise selection procedure. Did you look at the location of any of those voxel categories - e.g. are the "bad" voxels scattered around the ROIs or clustered?
I do not see consistent spatial patterns on voxel categories.
I think this idea of stability, both how to measure it and how to interpret different degrees (e.g. what proportion of the voxels should be selected in each fold for a ROI to be considered stable?), is an interesting area and worthy of more attention.

significance testing

My questions about significance testing were more general, trying to understand exactly what was done.

First, about the group-level permutation testing for the ROI analyses:
For each of the 5000 permutations (in each participant and each ROI), classes in the training and test sets were relabelled, and the permuted data was analysed in the same procedure as the observed data. The order of the relabeling in each permutation was kept the same across participants. All the permuted mean accuracies formed the null distributions and were used to estimate the statistical significance of the observed classification accuracy. 
If I understand, is this similar to the "scheme 1 stripes" in http://mvpa.blogspot.com/2012/11/permutation-testing-groups-of-people.html? In other words, you didn't bootstrap for the group null distribution, but rather kept the linking across subjects. Concretely, you calculated 5000 group-level averages for the null distribution, each of the 5000 the across-subjects average after applying the same relabling scheme to each participant. This method of doing the group-level permutation test seems quite sensible to me (when possible), but I haven't run across many examples in the literature.
Yes. Your blog post is clearer on describing this method! 
How to concisely (but clearly) describe group-level permutation schemes (not to mention the impact of choosing various schemes) is another area worthy of more attention.

Next, I'd wondered how the comparisons were made between the searchlight and ROI-based results.
The peak coordinates from searchlight analysis fall within the functional ROIs in PMd, PMv and IPS. I think the similarity between the two analyses, as claimed in our paper, should not be interpreted as “equality” in terms of the blob’s shapes and extents, but their approximate anatomical locations. The two types of analysis are complementary and are based on different assumptions. The searchlight assumes that information is represented in multiple adjacent voxels (as shown in searchlight “blobs”), while the ROI analysis assumes that information is distributed within a spatially bounded region. Therefore, unless a ROI is directly defined from searchlight results, significant classification accuracy in one ROI does not necessary imply that a searchlight blob would perfectly match the spatial distribution of that ROI. 

I absolutely agree that searchlight and ROI-based analyses are complementary, based on different assumptions (and detect different types of information), and shouldn't perfectly overlap. Having the 'peak' searchlight coordinate fall into particular ROIs is a reasonable way to quantify the similarity. This is another area in which there is unfortunately no standard comparison technique (that I know of); I could imagine looking at the percent overlap or something as well.

and the questions

I'd asked three questions at the end of the previous post; here are answers from Jiaxiang for the first two (the third was answered above).

Question 1:
The PEIs for both chosen and specified trials were used in that ROI analysis (no new PEIs were created). 
So there were twice as many examples (all the chosen and all the specified) in the pooled analysis as either context alone.
Yes. 

Question 2:
For the cross-stage classification, chosen and specified trials at the decision stage were modeled separately (chosen and specified trials at the decision stage were averaged together in the univariate analysis and the main MVPA). 
So new PEIs were made, including just the decision time stage or just the maintenance stage?

The new first level model for cross-stage classification include both decision and maintenance stages, as in the main model.

A fascinating study (and methods); thanks again to Jiaxiang for generously answering my questions and letting me share our correspondence here!

No comments:

Post a Comment