Friday, January 18, 2013

low-accuracy subjects' influence on group statistics

background

This post comes from something I saw in some of my actual searchlight results.The searchlight analysis was performed within a large anatomical area (~1500 voxels, svm within fairly small searchlights), separately in ~15 people. In most subjects, most all searchlights classified quite well, but in two subjects most searchlights classified quite poorly. As usual, there were a few low-accuracy searchlights in the "good" subjects (people in which most searchlights classified well), and a few high-accuracy searchlights in the "bad" subjects (people in which most searchlights did not classify accurately).

I made the group-level results by performing a t-test at each voxel: are the subjects' accuracies greater than chance? (I did not smooth the individual searchlight maps first.) I saw a worrisome pattern in the group maps: the peak (best t-value) areas tended to coincide where the "bad" subjects had their best (most accurate) searchlights. This outcome is not surprising (see below), but is not what we want: applying a strict threshold to the group map will identify voxels. But those particular voxels came out as peak because the "bad" subjects had accurate searchlights in those locations, not because the voxels were more accurate across subjects: the group map was overly influenced by the low-accuracy subjects.

simulation

Here's a little simulation to show the effect (R code is here). The code creates 15 searchlights, each of which contain 10 voxels. There are 12 people in the dataset, 10 of whom have signal in all searchlights, and 2 of whom have signal in all but the first two searchlights. It's a two-class classification, using a linear SVM, partitioning on the two "runs". Voxel values were sampled from a normal distribution, with different means for the voxels with signal but the same means for no signal.

The classification accuracies of the 12 people in each of the 15 searchlights are summarized in these boxplots (the figure should get larger if you click on it; run the code to get the underlying numbers). The accuracies vary a bit over the people and searchlights, as expected, given how the data was simulated. All of the "boxes" are well above chance, and all of the t-values (see figure labels) are above 5. So this seems reasonable.

The searchlights with the highest t-values are 1 and 2: the two searchlights which I assigned to have signal in the two "bad" subjects. You can see why in the boxplots: the bottom whisker in searchlights 1 and 2 only reach 0.6, while all the others have a whisker or outlier closer to chance. Some searchlights (like 8) have two outliers: the two "bad" subjects.

So the t-test didn't make an error: the first two searchlights should have the highest t-values, since they have the most individual accuracies the furthest above chance. But this could have led to an improper conclusion: If we had applied a high threshold (t > 8 in this simulation) we might think the first two searchlights are where the information falls, where in actuality it is evenly distributed.


what to do?

Follow-up testing or different group-level statistics can help to catch this type of situation. As often is the case, precise hypotheses are better (for example, if you want to find searchlights with significant classification in most subjects individually, test for that directly - don't assume that the best searchlights in a t-test will also have that property).

Here are a few suggestions for follow-up testing:
  • Look at the individual subject's searchlight maps: Are a few subjects quite different than the rest (such as here, where two people had quite low accuracies compared to the others)? 
  • Sensitivity testing can help: how much does the group-level map change when individual subjects are left out? 
  • Do the group-level results align more closely with some subjects than others?