Friday, May 6, 2016

"Classification Based Hypothesis Testing in Neuroscience", permuting

My previous post described the below-chance classification part of a recent paper by Jamalabadi et. al; this post will get into the parts on statistical inference and permutation testing.

First, I fully agree with Jamalabadi et. al that MVPA classification accuracy (or CCR, "correct classification rate", as they call it) alone is not sufficient for estimating effect size or establishing significance. As they point out, higher accuracy is better, but it can only be directly compared within a dataset: the number of examples, classifier, cross-validation scheme, etc. all influence whether or not a particular accuracy is "good". Concretely, it is very shaky to interpret a study as "better" if it classified at 84% while a different dataset in a different study classified at 76%; if, however, you find that changing the scaling of a dataset improves classification (of a single dataset) from 76 to 84%, you'd be more justified in calling it an improvement.

The classification accuracy is not totally meaningless, but you need something to compare it to for statistical inference. As Jamalabadi et. al put it (and I've also long advocated), "We propose that MVPA results should be reported in terms of P values, which are estimated using randomization tests."{Aside: I think it's ok to use parametric tests for group-level inference in particular cases and after checking their assumptions, but prefer permutation tests and think they can provide stronger evidence.}

But there's one part of the paper I do not agree with, and that's their discussion of the prevalence of highly non-normal null distributions. The figure at left is Figure 5 from the paper, and are very skewed and non-normal null distributions resulting from classifying simulated datasets in different ways (chance should be 0.5). They show quite a few skewed null distributions from different datasets in the paper, and in the Discussion state that, "For classification with cross-validation in typical life-science data, i.e., small sample size data holding small effects, the distribution of classification rates is neither normal nor binomial."

However, I am accustomed to seeing approximately normal null distributions with MVPA, even in situations with very small effects and sample sizes. For example, below are null distributions (light blue) from eight simulated datasets. Each dataset was created to have 20 people, each with 4 runs of imaging data, each of which has 10 examples of each of 2 classes, and a single 50-voxel ROI. I generated the "voxel" values from a standard normal, with varying amounts of bias added to the examples of one class to allow classification. Classification was with leave-one-run-out cross-validation within each person, then averaging across the runs for the group-level accuracy; 1000 label rearrangements for the permutation test, following a dataset-wise scheme, averaging across subjects like in this demo.

The reddish line in each plot is the accuracy of the true-labeled dataset, which you can see increases from left to right across the simulated datasets, from 0.51 (barely above chance) to 0.83 (well above chance). The permutation test (perm. p) becomes more significant as the accuracy increases, since the true-labeled accuracy shifts to the right of the null distribution.

Note however, that the null distributions are nearly the same and approximately normal for all eight datasets. This is sensible: while the amount of signal in the simulated datasets increases, they all have the same number of examples, participants, classification algorithm (linear SVM, c=1), and cross-validation scheme. The different amounts of signal don't affect the permutation datasets: since the labels were randomized (within each subject and run), all permutation datasets are non-informative, and so produce similar null distributions. The null distributions above are for the group level (with the same dataset-wise permutation relabelings used within each person); I typically see more variability in individual null distributions, but with still approximate normality.

I suspect that the skewed null distributions obtained by Jamalabadi et. al are due either to the way in which the labels were permuted (particularly, that they might not have followed a dataset-wise scheme), or to the way the datasets were generated (which can have a big impact). Regardless, I have never seen as highly-skewed null distributions in real data as those reported by Jamalabadi et. al. Jamalabadi H, Alizadeh S, Schönauer M, Leibold C, & Gais S (2016). Classification based hypothesis testing in neuroscience: Below-chance level classification rates and overlooked statistical properties of linear parametric classifiers. Human brain mapping, 37 (5), 1842-55 PMID: 27015748

Monday, April 25, 2016

"Classification Based Hypothesis Testing in Neuroscience"

There's a lot of interesting MVPA methodology in a recent paper by Jamalabadi et. al, with the long (but descriptive) title "Classification Based Hypothesis Testing in Neuroscience: Below-Chance Level Classification Rates and Overlooked Statistical Properties of Linear Parametric Classifiers". I'll focus on the below-chance classification part here, and hopefully get to the permutation testing parts in detail in another post; for a very short version, I have no problem at all with their advice to report p-values and null distributions from permutation tests to evaluate significance, and agree that accuracy alone is not sufficient, but they have some very oddly-shaped null distributions, which make me wonder about their permutation scheme.

Anyway, the below-chance discussion is mostly in the section "Classification Rates Below the Level Expected for Chance" and Figure 3, with proofs in the appendices. Jamalabadi et. al set up a series of artificial datasets, designed to have differing amounts of signal and number of examples. They get many below-chance accuracies when "sample size and estimated effect size is low", which they attribute to "dependence on the subsample means":
 "Thus, if the test mean is a little above the sample mean, the training mean must be a little below and vice versa. If the means of both classes are very similar, the difference of the training means must necessarily have a different sign than the difference of the test means. This effect does not average out across folds, ....."
They use Figure 3 to illustrate this dependence in a toy dataset. That figure is really too small to see online, so here's a version I made (R code after the jump if you want to experiment).
This is a toy dataset with two classes (red and blue), 12 examples of each class. The red class is from a normal distribution with mean 0.1, the blue, a normal distribution with mean -0.1. The full dataset (at left) shows a very small difference between the classes: the mean of the the blue class is a bit to the left of the mean of the red class (top row triangles); the line separates the two means.

Following Jamalabadi et. al's Figure 3, I then did a three-fold cross-validation, leaving out four examples each time. One of the folds is shown in the right image above; the four left-out examples in each class are crossed out with black x. The diamonds are the mean of the training set (the eight not-crossed-out examples in each class). The crossed diamonds are the means of the test set (the four crossed-out examples in each class): and they are flipped: the blue mean is on the red side, and the red mean on the blue side. Looking at the position of the examples, all of the examples in the blue test set will be classified wrong, and all but one of the red: accuracy of 1/8, which is well below chance.

This is the "dependence on subsample means": pulling out the test set shifts the means of the remaining examples (training set) in the other direction, making performance worse (in the example above, the training set means are further from zero than the full dataset). This won't matter much if the two classes are very distinct, but can have a strong impact when they're similar (small effect size), like in the example (and many neuroimaging datasets).

Is this an explanation for below-chance classification? Yes, I think it could be. It certainly fits well with my observations that below-chance results tend to occur when power is low, and should not be interpreted as anti-learning, but rather of poor performance. My advice for now remains the same: if you see below-chance classification, troubleshoot and try to boost power, but I think we now have more understanding of how below-chance performance can happen. Jamalabadi H, Alizadeh S, Schönauer M, Leibold C, & Gais S (2016). Classification based hypothesis testing in neuroscience: Below-chance level classification rates and overlooked statistical properties of linear parametric classifiers. Human brain mapping, 37 (5), 1842-55 PMID: 27015748

follow the jump for the R code to create the image above

Tuesday, March 15, 2016

pointer: PRNI 2016 paper submission deadline next week

The paper submission deadline for Pattern Recognition in Neuroimaging (PRNI) 2016 has been extended to 24 March, so be sure to get your manuscript in! For those of you with a psychology-type background, note that papers accepted to PRNI are cite-able, peer-reviewed publications, and will be published by IEEE.

This workshop is a great way to meet people interested in MVPA methods; not just SVMs and fMRI (though that's present), but also MEG, EEG, structural MRI, Bayesian methods, model interpretation, etc, etc. PRNI is in Trento, Italy this year, with a shuttle bus to Geneva, Switzerland for those attending OHBM as well (PRNI is held immediately prior to OHBM).

Tuesday, February 23, 2016

distance metrics: what do we mean by "similar"?

There are many ways of quantifying the distance (aka similarity) between timecourses (or any numerical vector), and distance metrics sometimes vary quite a bit in which properties they use to quantify similarity. As usual, it's not that one metric is "better" than another, it's that you need to think about what constitutes "similar" and "different" for a particular project, and pick a metric that captures those characteristics.

I find it easiest to understand the properties of different metrics by seeing how they sort simple timecourses. This example is adapted from Chapter 8 (Figure 8.1 and Table 8.1) of H. Charles Romesburg's Cluster Analysis for Researchers. This is a highly readable book, and I highly recommend it as a reference for distance metrics (he covers many more than I do here), clustering, tree methods, etc. The computer program sections are dated, but such a minor part of the text that it's not a problem.

Here are the examples we'll be measuring the distance between (similarity of). To make it concrete, you could imagine these are the timecourses of five voxels, each measured at four timepoints. The first four timeseries are the examples from Romesburg's Figure 8.1. Timeseries 1 (black line) is the "baseline"; 2 (blue line) is the same shape as 1, but shifted up 15 units; 3 (red line) is baseline * 2; and 4 (green line) is the mirror image of the baseline (line 1, reflected across y = 20). I added line 5 (grey), to have a similar mean y as baseline, but closer in shape to line 4.

Here are the values for the same five lines:
 data1 <- c(20,40,25,30);   
 data2 <- c(35,55,40,45);   
 data3 <- c(40,80,50,60);   
 data4 <- c(20, 0,15,10); # first four from Romesburg Table 8.1  
 data5 <- c(30,20,26,20); # and another line  

Which lines are most similar? Three metrics.

If we measure with Euclidean distance, lines 1 and 5 are closest. The little tree at right is built from the distance matrix printed below, using the R code that follows. I used R's built-in function to calculate the Euclidean distance between each pair of lines, putting the results into tbl in the format needed by hclust.

Euclidean distance pretty much sorts the timecourses by their mean y: 1 is most similar (smallest distance) to 5, next-closest to 2, then 4, then 3 (read these distances from the first column in the table at right).

Thinking of these as fMRI timecourses, Euclidean distance pretty much ignores the "shape" of the lines (voxels): 1 and 5 are closest, even though voxel 1 has "more BOLD" at timepoint 2 and voxel 5 has "less BOLD" at timepoint 2. Likewise, voxel 1 is closer to voxel 4 (its mirror image) than to voxel 3, though I think most people would consider timecourses 1 and 3 more "similar" than 1 and 4.

 tbl <- array(NA, c(5,5));   
 tbl[1,1] <- dist(rbind(data1, data1), method='euclidean');   
 tbl[2,1] <- dist(rbind(data1, data2), method='euclidean');   
 .... the other table cells ....
 tbl[5,5] <- dist(rbind(data5, data5), method='euclidean');   
 plot(hclust(as.dist(tbl)), xlab="", ylab="", sub="", main="euclidean distance"); # simple tree  

Pearson correlation sorts the timecourses very differently than Euclidean distance.

Here's the tree and table for the same five timecourses, using 1-Pearson correlation as the distance metric. Now, lines 2 and 3 are considered exactly the same (correlation 1, distance 0) as line 1; in Romesburg's phrasing, Pearson correlation is "wholly insensitive" to additive and proportional translations. Consistently, lines 4 and 5 (the "downward pointing" shapes) are grouped together, while line 4 (the mirror image) is maximally dissimilar to line 1.

So, Pearson correlation may be suitable if you're more interested in shape than magnitude. In the fMRI context, we could say that correlation considers timecourses that go up and down together as similar, ignoring overall BOLD. If you care that one voxel's timecourse has higher BOLD than another (here, like 2 or 3 being higher than 1), you don't want to use Pearson correlation.

 tbl <- array(NA, c(5,5));  
 tbl[1,1] <- cor(data1, data1);  # method="pearson" is default  
 tbl[2,1] <- cor(data1, data2);  
 tbl[3,1] <- cor(data1, data3);   
 .... the other table cells ....
 tbl[5,5] <- cor(data5, data5);   
 plot(hclust(as.dist(1-tbl)), xlab="", ylab="", sub="", main="1 - Pearson correlation");   

 The coefficient of shape difference metric (page 99 in Romesburg) mixes a bit of the properties of
Euclidean distance and Pearson correlation: it ignores additive translations, but is sensitive to proportional translations.

As seen here, like correlation, the coefficient of shape difference considers lines 1 and 2 identical (maximally similar), but unlike correlation, line 3 is not considered identical to line 1. Like Euclidean distance, the coefficient of shape difference considers lines 3 and 4 farther apart than any other pair of lines (correlation listed 1, 2, and 3 as all equally far from line 4).

I didn't find the coefficient in a built-in R function, but its equation (8.3 in Romesburg) is very simple to implement, as in the code below.

I've tried using the coefficient of shape difference in a few analyses, as its property of being sensitive to proportional translations more closely matches my intuitive understanding of "similar" timecourses. I haven't used it in any published analyses yet, as Pearson correlation has done better. But it certainly seems worth considering.

 # coefficient of shape difference, page 99 of Romesburg  
 get.dist <- function(dataA, dataB) {  
  if (length(dataA) != length(dataB)) { stop("length(dataA) != length(dataB)"); }  
  n <- length(dataA); # number of dimensions  
  d <- sqrt((n/(n-1))*(sum((dataA-dataB)^2)/n - ((1/(n^2))*(sum(dataA) - sum(dataB))^2)));   
 tbl <- array(NA, c(5,5)); # tbl <- array(NA, c(4,4)); #  
 tbl[1,1] <- get.dist(data1, data1); 
.... the other table cells ....
 tbl[5,5] <- get.dist(data5, data5);   
 plot(hclust(as.dist(tbl)), xlab="", ylab="", sub="", main="coeff of shape diff");  

These three ways of measuring the similarity (distance) between timecourses are certainly not the only metrics, but I hope it's clear from just these three that the metric matters; they're not interchangeable.

Tuesday, February 9, 2016

R demo: specifying side-by-side boxplots in base R

This post has the base R code (pasted after the jump below, and available here) to produce the boxplot graphs shown at right.

Why base R graphics? I certainly have nothing against ggplot2, lattice, or other graphics packages, and  used them more in the past. I've found it easier, though (as in the demo code), to specify the values and location for each boxplot instead of generating complex conditioning statements. This may be a bit of the opposite of the logic of the Grammar of Graphics, but oh well.

Sunday, January 3, 2016

yep, it's worth the time to publish code

I have a paper that's getting close to being published in PLOSone. After review, but prior to publication, they require you to not only agree in principle to sharing the dataset, but that it actually be shared. This can be tricky with neuroimaging datasets (size, privacy, etc.), but is of course critically important. It's easy to procrastinate on putting in the time necessary to make the datasets and associated code public; and easy to be annoyed at PLOS for requiring sharing to be set up prior to publication, despite appreciating that such a policy is highly beneficial.

As you can guess from the post title, in the process of cleaning up the code and files for uploading to the OSF, I found a coding bug. (There can't be many more horrible feelings as a scientist than finding a bug in the code for a paper that's already been accepted!) The bug was that when calculating the accuracy across the cross-validation folds, one of the fold's accuracies was omitted. Thankfully, this was a 15-fold cross-validation, so fixing the code so that the mean is calculated over all 15 folds instead of just 14 made only a minuscule difference in the final results, nothing that changed any interpretations.

Thankfully, since the paper is not yet published, it was simple to correct the relevant table. But had I waited to prepare the code for publishing until after the paper had been published (or not reviewed the code at all), I would not have caught the error. Releasing the code necessary to create particular figures and tables is a less complicated undertaking than designing fully reproducible analyses, such as what Russ Poldrack's group is working on, but still nontrivial, in terms of both effort and benefits.

How to avoid this happening again? A practical step might be to do the "code clean up" step as soon as a manuscript goes under review: organize the scripts, batch files, whatever, that generated each figure and table in the manuscript, and, after reviewing them yourself, have a colleague spend a few hours looking them over and confirming that they run and are sensible. In my experience, it's easy for results to become disassociated from how they were generated (e.g., figures pasted into a powerpoint file), and so for bugs to persist undetected (or errors, such as a 2-voxel radius searchlight map being labeled as from a 3-voxel radius searchlight analysis). Keeping the code and commentary together in working files (e.g., via knitr) helps, but there's often no way around rerunning the entire final analysis pipeline (i.e., starting with preprocessed NIfTIs) to be sure that all the steps actually were performed as described.

UPDATE 1 February 2016: The paper and code are now live.

Friday, December 18, 2015

balanced accuracy: what and why?

I was a bit thrown at OHBM by mentions of using "balanced" accuracy to describe classification performance ... what's wrong with "regular" accuracy? Well, nothing is wrong with "regular" accuracy, if your dataset is already balanced (meaning equal numbers of trials in each class and cross-validation fold, which I very highly recommend). But if your test set is not balanced, "balanced" accuracy is probably better than "regular".

At left is an example of a confusion matrix produced by a classifier, where the test set was balanced, with 10 examples of class face, and 10 of class place. The classifier did quite well: 9 of the 10 face examples were (correctly) labeled face, and 8 of the 10 place examples were (correctly) labeled place. For lack of a better term, what I'll call "regular" or "overall" accuracy is calculated as shown at left: the proportion of examples correctly classified, counting all four cells in the confusion matrix. Balanced accuracy is calculated as the average of the proportion corrects of each class individually. In this example, both the overall and balanced calculations produce the same accuracy (0.85), as will always happen when the test set has the same number of examples in each class.

This second example is the same as the first, but I took out one of the place test examples: now the testing dataset was not balanced, as it has 10 face examples, but only 9 place examples. As you can see, now the overall accuracy is a bit higher than the balanced.

It's easy to generate more confusion matrices to get a feel for how the two ways of calculating accuracy differ, such as with this R code:

 c.matrix <- rbind(c(1,0),   
 sum(diag(c.matrix))/sum(c.matrix);  # "overall" proportion correct  
 first.row <- c.matrix[1,1] / (c.matrix[1,1] + c.matrix[1,2])  
 second.row <- c.matrix[2,2] / (c.matrix[2,1] + c.matrix[2,2])  
 (first.row + second.row)/2; # "balanced" proportion correct  

So, should we use balanced accuracy instead of overall? Yes, it's probably better to use balanced accuracy when there's just one test set, and it isn't balanced. I tend to be extremely skeptical about interpreting classification results when the training set is not balanced, and would want to investigate a lot more before deciding that balanced accuracy reliably compensates for unbalanced training sets. However, it's probably fine to use balanced accuracy with unbalanced test sets in situations like cross-classification, where a classifier is trained once on a balanced training set (e.g., one person's dataset), and then tested once (e.g., another person's dataset). Datasets requiring cross-validation need to be fully balanced, because the each testing set contributes to the training set in other folds.

For more, see Brodersen, Kay H., Cheng Soon Ong, Klaas E. Stephan, and Joachim M. Buhmann. 2010. "The balanced accuracy and its posterior distribution." DOI: 10.1109/ICPR.2010.764