Tuesday, February 9, 2016

R demo: specifying side-by-side boxplots in base R

This post has the base R code (pasted after the jump below, and available here) to produce the boxplot graphs shown at right.

Why base R graphics? I certainly have nothing against ggplot2, lattice, or other graphics packages, and  used them more in the past. I've found it easier, though (as in the demo code), to specify the values and location for each boxplot instead of generating complex conditioning statements. This may be a bit of the opposite of the logic of the Grammar of Graphics, but oh well.

Sunday, January 3, 2016

yep, it's worth the time to publish code

I have a paper that's getting close to being published in PLOSone. After review, but prior to publication, they require you to not only agree in principle to sharing the dataset, but that it actually be shared. This can be tricky with neuroimaging datasets (size, privacy, etc.), but is of course critically important. It's easy to procrastinate on putting in the time necessary to make the datasets and associated code public; and easy to be annoyed at PLOS for requiring sharing to be set up prior to publication, despite appreciating that such a policy is highly beneficial.

As you can guess from the post title, in the process of cleaning up the code and files for uploading to the OSF, I found a coding bug. (There can't be many more horrible feelings as a scientist than finding a bug in the code for a paper that's already been accepted!) The bug was that when calculating the accuracy across the cross-validation folds, one of the fold's accuracies was omitted. Thankfully, this was a 15-fold cross-validation, so fixing the code so that the mean is calculated over all 15 folds instead of just 14 made only a minuscule difference in the final results, nothing that changed any interpretations.

Thankfully, since the paper is not yet published, it was simple to correct the relevant table. But had I waited to prepare the code for publishing until after the paper had been published (or not reviewed the code at all), I would not have caught the error. Releasing the code necessary to create particular figures and tables is a less complicated undertaking than designing fully reproducible analyses, such as what Russ Poldrack's group is working on, but still nontrivial, in terms of both effort and benefits.

How to avoid this happening again? A practical step might be to do the "code clean up" step as soon as a manuscript goes under review: organize the scripts, batch files, whatever, that generated each figure and table in the manuscript, and, after reviewing them yourself, have a colleague spend a few hours looking them over and confirming that they run and are sensible. In my experience, it's easy for results to become disassociated from how they were generated (e.g., figures pasted into a powerpoint file), and so for bugs to persist undetected (or errors, such as a 2-voxel radius searchlight map being labeled as from a 3-voxel radius searchlight analysis). Keeping the code and commentary together in working files (e.g., via knitr) helps, but there's often no way around rerunning the entire final analysis pipeline (i.e., starting with preprocessed NIfTIs) to be sure that all the steps actually were performed as described.

UPDATE 1 February 2016: The paper and code are now live.

Friday, December 18, 2015

balanced accuracy: what and why?

I was a bit thrown at OHBM by mentions of using "balanced" accuracy to describe classification performance ... what's wrong with "regular" accuracy? Well, nothing is wrong with "regular" accuracy, if your dataset is already balanced (meaning equal numbers of trials in each class and cross-validation fold, which I very highly recommend). But if your test set is not balanced, "balanced" accuracy is probably better than "regular".

At left is an example of a confusion matrix produced by a classifier, where the test set was balanced, with 10 examples of class face, and 10 of class place. The classifier did quite well: 9 of the 10 face examples were (correctly) labeled face, and 8 of the 10 place examples were (correctly) labeled place. For lack of a better term, what I'll call "regular" or "overall" accuracy is calculated as shown at left: the proportion of examples correctly classified, counting all four cells in the confusion matrix. Balanced accuracy is calculated as the average of the proportion corrects of each class individually. In this example, both the overall and balanced calculations produce the same accuracy (0.85), as will always happen when the test set has the same number of examples in each class.

This second example is the same as the first, but I took out one of the place test examples: now the testing dataset was not balanced, as it has 10 face examples, but only 9 place examples. As you can see, now the overall accuracy is a bit higher than the balanced.


It's easy to generate more confusion matrices to get a feel for how the two ways of calculating accuracy differ, such as with this R code:


 c.matrix <- rbind(c(1,0),   
                   c(1,1));  
 sum(diag(c.matrix))/sum(c.matrix);  # "overall" proportion correct  
 first.row <- c.matrix[1,1] / (c.matrix[1,1] + c.matrix[1,2])  
 second.row <- c.matrix[2,2] / (c.matrix[2,1] + c.matrix[2,2])  
 (first.row + second.row)/2; # "balanced" proportion correct  

So, should we use balanced accuracy instead of overall? Yes, it's probably better to use balanced accuracy when there's just one test set, and it isn't balanced. I tend to be extremely skeptical about interpreting classification results when the training set is not balanced, and would want to investigate a lot more before deciding that balanced accuracy reliably compensates for unbalanced training sets. However, it's probably fine to use balanced accuracy with unbalanced test sets in situations like cross-classification, where a classifier is trained once on a balanced training set (e.g., one person's dataset), and then tested once (e.g., another person's dataset). Datasets requiring cross-validation need to be fully balanced, because the each testing set contributes to the training set in other folds.

For more, see Brodersen, Kay H., Cheng Soon Ong, Klaas E. Stephan, and Joachim M. Buhmann. 2010. "The balanced accuracy and its posterior distribution." DOI: 10.1109/ICPR.2010.764

Thursday, November 19, 2015

"Assessment of tonotopically organised subdivisions in human auditory cortex using volumetric and surface-based cortical alignments"

Given my occasional forays into surface analysis (and lack of conviction that it's necessarily a good idea for fMRI), I was intrigued by some of the analyses in a recent paper from Dave Langers ("Assessment of tonotopically organised subdivisions in human auditory cortex using volumetric and surface-based cortical alignments").

For task fMRI, the most transparent way to compare different analysis methods seems to use a task that's fairly well defined and anatomically predictable. Motion and primary senses are probably most tractable; for example a task that has blocks of finger-tapping and toe-wiggling should produce very strong signal, and be distinguishable in motor cortex. This gives a basis of comparison: with which method do we see the finger and toe signals most distinctly?

Langers (2014) investigated tonotopic maps in auditory cortex, which map the frequency of heard sounds, using volumetric and surface analysis. While tonotopic maps are not fully understood (see the paper for details!), this is a well-defined question for comparing surface and volumetric analysis of fMRI data: we know where the primary auditory cortex is located, and we know when people were listening to which sounds. He used a Hotelling's T2-related statistic for describing the "tonotopic gradient vectors" which reminds me a bit of cvMANOVA and the LD-t, but I'll just concentrate on the surface vs. volume aspects here.

This is Figure 2 from the paper, which, gives a flowchart of the procedures for the surface and volume versions of the analysis. He mapped the final volumetric results onto a (group) surface to make it easier to compare the surface and volume results, but the preprocessing and statistics were carried out separately (and it seems to me, reasonably): SPM8 for volumetric, freesurfer for surface. The fMRI voxels were small - just 1.5 x 1.5 . 1.5 mm, which is plausible to support surface analysis.

So, which did better, surface or volume? Well, to quote from the Discussion, "... the activation detected by the surface-based method was stronger than that according to the volumetric method. At the same time, the related statistical confidence levels were more or less the same." The image below is part of Figure 4 from the paper, showing a few of the group-level results (take a look at the full paper). As he observes, "The surface-based results had a noisier appearance than the volumetric ones, in particular for the frequency-related outcomes shown in Figure 4c–e."


So, in this paper, while the surface analysis seemed to result in better anatomic alignments between people, the activation maps were not clearer or more significant. I'd very much like to see more comparisons of this type (particularly with HCP processing, given its unique aspects), to see if this is a common pattern, or something unique to tonotopic maps in auditory cortex and this particular protocol.


ResearchBlogging.org Langers, D. (2014). Assessment of tonotopically organised subdivisions in human auditory cortex using volumetric and surface-based cortical alignments Human Brain Mapping, 35 (4), 1544-1561 DOI: 10.1002/hbm.22272

Wednesday, November 11, 2015

positive control analyses and checking data quality

I've long advocated that much pain can be avoided if analysis of a new dataset is begun with positive control analyses: classifying (or whatever) something that should produce a very strong signal in easy-to-identify regions. Button-presses often are good controls (particularly if the button boxes were in different hands): when classifying whether the left-hand or right-hand was moved (totally ignoring the experimental conditions), do you find motor areas? Classifying the presence or absence of visual stimuli is also a good, strong control. Once a positive control is found, some of the analysis choices and data quality checks can be run on the control analysis then carried over to the target analysis, reducing the chance of exploiting too many experimenter degrees of freedom.

Another reason to start with positive control analyses is simply to identify problems in the dataset. If the control analysis fails in a particular person, why? Were the event timings mislabeled? Movement too high? Preprocessing failed? I'd be very worried about interpreting the results of a subtle cognitive task in a person whose data is of too poor a quality to support classifying something as strong as hand movements.

The rest of this post is an example of what positive control analyses can look like, and how their results compare to measures of general dataset quality. Several practiCal fMRI posts were very useful for thinking about how to visualize the image quality, particularly this one describing temporal SNR and this one showing examples of high-quality images.

average the fMRI timeseries for each voxel

First, run-wise mean images. These are simply the average of the functional timeseries for each voxel, each run and person separately. I calculated these on the images after preprocessing, but before voxelwise normalization. This is evaluating the quality of the images as they "arrive" for MVPA; in this case, after motion-correction, slice-time correction, and spatial normalization to an anatomic template. We thus expect the slices to look fairly similar in all people (because of the normalization), basically like fuzzy anatomical images.

The images below show two slices (one coronal, one axial) of the mean fMRI image for four runs in six people from two datasets (people in rows, runs in columns). The first image shows a dataset with fairly decent spatial normalization, the second, not-so-good spatial normalization (the images should enlarge if clicked).

A dataset with fairly decent spatial normalization. While the images vary a bit (e.g., subject 17 is rather "pointy" compared to the others), they are all orientated correctly and capture the same brain structures.
A dataset with not-so-good spatial normalization. Note that each person is pretty consistently normalized with themselves (i.e., the images in the four runs within each person are similar), but vary quite a bit between people. For example, sub7's brain looks "short", and when viewed in 3d, is clearly tilted.

In my estimation, analysis should not proceed on this dataset: spatial normalization needs to be improved, or analysis should be performed in subject (native) space.




standard deviation of the fMRI timeseries for each voxel

As described by practiCal fMRI, images of the standard deviation of the fMRI timeseries are useful for spotting motion or other artifacts; see his post for more details. Basically, dimmer is better for these images, and we want to be able to see some brain structure. As with the mean images, these are simply calculating the standard deviation of each voxel's timeseries, separately within each run, using the post-preprocessing functional images. All image voxels were included, not just those in the brain mask, to allow spotting of blurry edges and ghosts.

The top part of this image are the standard deviations. This follows the convention of the mean images: subjects in the rows, four runs in the columns, with coronal slices first, then axial slices, both of the same 3d image. All images have the same color scaling, so brightness can be compared between runs and people.

Subject 34 is the best of these three people: the images for the four runs are pretty equally dark, but the brain outline and structure are visible. Subject 37 has the second and first runs much brighter and blurrier than the third and fourth runs; the first run in subject 36 is also brighter and blurrier than the others. These runs had more movement artifacts, reflected here as higher standard deviation.

The bottom part of this image is the accuracy from a positive control searchlight analysis in these same three people. In this case, the control analysis was classifying whether a particular image was from a cue or target/response trial segment, and we expect visual and motor areas to classify. (If you're curious, it was leave-one-run-out cross-validation within each person, linear SVM, c=1, 3-voxel radius searchlights, two balanced classes.) The overlay is color-scaled to show voxels with accuracy of 0.6 as red, 1 (perfect) as brightest yellow, not showing voxels with accuracy less than 0.6 (chance = 0.5). (I used knitr to make all the images in this post; see this demo for similar code.)

The accuracies and standard deviation are consistent in these images: sub34 has the lowest standard deviation (and highest temporal SNR, though this isn't shown here) and highest classification accuracy; sub36 and sub37 have fewer high-classifying searchlights. The relationship between image quality in these diagnostic tests and control classification accuracy is not always this clear, but I have seen it pretty often, and it should exist; by definition, the control classification should succeed in people with decent image quality. If it does not, the dataset should be checked for errors, such as mislabeled event timing files.

There's no magic threshold for image quality, nor perfect strategy for recovering signal from high-movement runs. But I would be very hesitant to continue analyzing a person without clear signal in the control analysis, particularly if they stand out in the mean and standard deviation images.

Friday, September 25, 2015

How to do cross-validation? Have to partition on the runs?

People sometimes ask why many MVPA studies use leave-one-run-out cross-validation (partitioning on the runs), and if it's valid to set up the cross-validation in other ways, such as leaving out individual trials. Short answer: partitioning on the runs is a sensible default for single-subject analyses, but (as usual with fMRI) other methods are also valid in certain situations.

As I've written before, cross-validation is not a statistical test itself, but rather the technique by which we get the statistic we want to test (see this for a bit more on doing statistical testing). My default example statistic is accuracy from a linear SVM, but these issues apply to other statistics and algorithms as well.

Why is partitioning on the runs a sensible default? A big reason is that scanner runs (also called "sessions") are natural breaks in fMRI datasets: a run is the period in which the scanner starts, takes a bunch of images at intervals of one TR, then stops. Images from within a single scanner run are intrinsically more related to each other than to images from different runs, partly because they were collected closer together in time (so the subject's mental state and physical position should be more similar; single hemodynamic responses can last more than one TR), and because many software packages perform some corrections on each run individually (e.g., SPM motion correction usually aligns each volume to the first image within its run). The fundamental nature of temporal dependency is reflected in the terminology of many analysis programs, such as "chunks" in pyMVPA and "bricks" in afni.

So, since the fMRI data within each person is structured into runs, it makes sense to partition on the runs. It's not necessary to leave-one-run-out, but if you're not leaving some multiple number of runs out for the cross-validation, you should verify that your (not run-based) scheme is ok (more later). By "some multiple number of runs" I mean leaving out two, three, or more runs on each cross-validation fold. For example, if your dataset has 20 short runs you will likely have more stable (and significant) performance if you leave out two or four runs each fold rather than just one (e.g., Permutation Tests for Classification. AI Memo 2003-019). Another common reason to partition with multiple runs is if you're doing some sort of cross-classification, like training on all examples from one scanning day and testing on all examples from another scanning day. With cross-classification there might not be full cross-validation, such as if training on "novel" runs and testing on "practiced" runs is of interest for an experiment, but training on "practiced" runs and testing with "novel" runs is not. There's no problem with these cross-classification schemes, so long as one set of runs is used for training, and another (non-intersecting) set of runs for testing in an experiment-sensible (and ideally a priori) manner.

But it can sometimes be fine to do cross-validation with smaller bits of the dataset than the run, such as leaving out individual trials or blocks. The underlying criteria is the same, though: are there sensible reasons to think that the units you want to use for cross-validation are fairly independent? If your events are separated by less than 10 seconds or so (i.e., HRF duration) it's hard to argue that it'd be ok to put one event in the training set and the event after it into the testing set, since there will be so much "smearing" of information from the first event into the second (and doubly so if the event ordering is not perfectly balanced). But if your events are fairly well-spaced (e.g. a block design with 10 second action blocks separated by 20 seconds of rest), then yes, cross-validating on the blocks might be sensible, particularly if preprocessing is aimed at isolating the signal from individual blocks.

If you think that partitioning on trials or blocks, rather than full runs, is appropriate for your particular dataset and analysis, I suggest you include an explanation of why you think it's ok (e.g., because of the event timing), as well as a demonstration that your partitioning scheme did not positively bias the results. What demonstration is convincing necessarily varies with dataset and hypotheses, but could include "random runs" and control analyses. I described "random runs" in The impact of certain methodological choices ... (2011); the basic idea is to cross-validate on the runs, but then randomly assign the trials to new "runs", and do the analysis again (i.e., so that trials from the same run are mixed into the training and testing sets). If partitioning on the random runs and the real runs produce similar results, then mixing the trials didn't artificially increase performance, and so partitioning other than on the runs (like leaving out individual blocks) is probably ok. Control analyses are classifications that shouldn't work, such as using visual cortex to classify an auditory stimulus. If such negative control analyses aren't negative (can classify) in a particular cross-validation scheme, then that cross-validation scheme shouldn't be trusted (the training and testing sets are not independent, or some other error is occurring). But if the target analysis (e.g., classifying the auditory stimuli with auditory cortex) is successful while the control analysis (e.g., classifying the auditory stimuli with visual cortex) is not, and both use the same cross-validation scheme, then the partitioning was probably ok.

This post has focused on cross-validation when each person is analyzed separately. However, for completeness, I want to mention that another sensible way to partition fMRI data is on the people, such as with leave-one-subject-out cross-validation. Subjects are generally independent (and so suitable "chunks" for cross-validation), unless the design includes an unusual population (e.g., identical twins) or design (e.g., social interactions).

Friday, September 11, 2015

a nice "Primer on Pattern-Based Approaches to fMRI"

John-Dylan Haynes has a new MVPA review article out in Neuron, "A Primer on Pattern-Based Approaches to fMRI" (full citation below). I say "new" because he (and Rees) was the author of one of the first MVPA overview articles ("Decoding mental states from brain activity in humans"), back in 2006. As with the earlier article, this is a good introduction to some of the main methods and issues; I'll highlight a few things here, in no particular order.

At left is part of Figure 4, which nicely illustrates four different "temporal selection" options for event-related designs. The red and green indicate the two different classes of stimuli (cats and dogs), with the black line the fMRI signal in a single voxel across time.

I often refer to these as "temporal compression" options, but like "temporal selection" even better: we are "selecting" how to represent the temporal aspect of the data, and "compression" isn't necessarily involved.

John-Dylan Haynes (quite properly, in my opinion) recommends permutation tests (over binomial and t-tests) for estimating significance, noting that one of their (many) benefits is in detecting biases or errors in the analysis procedure.

Overall, I agree with his discussion of spatial selection methods (Figure 3 is a nice summary), and was intrigued by the mention of wavelet pyramids as a form of spatial filtering. I like the idea of spatial filtering to quantify the spatial scale at which information is present, but haven't run across a straightforward implementation (other than searchlight analysis); wavelet pyramids might need more investigation.

I also appreciate his caution against trying too many different classifiers and analysis procedures on the same dataset, pointing out that this can cause "circularity and overfitting". This problem is also sometimes referred to as having too many experimenter degrees of freedom: if you try enough versions of an analysis you can probably find one that's "significant." His advice to use nested cross-validation strategies (e.g., tuning the analysis parameters on a subset of the subjects) is solid, if sometimes difficult in practice because of the limited number of available subjects. The advice to use an analysis that "substantially decreases the number of free parameters" is also solid, if somewhat unsatisfying: I often advise people to use linear SVM, c=1 as the default analysis. While tuning parameters and testing other classifiers could conceivably lead to a higher accuracy, the risk of false positives from exploiting experimenter degrees of freedom is very real.

I also like his inclusion of RSA and encoding methods in this "primer". Too often we think of classifier-based MVPA, RSA, and encoding methods as unrelated techniques, but it's probably more accurate to think of them as "cousins," or falling along a continuum.

It's clear by now that I generally agree with his issue framing and discussion, though I do have a few minor quibbles, such as his description of MVPA methods as aimed at directly studying "the link between mental representations and corresponding multivoxel fMRI activity patterns". I'd assert that MVPA are also useful as a proxy for regional information content, even without explicit investigation of cognitive encoding patterns. But these are minor differences; I encourage you to take a look at this solid review article.


 ResearchBlogging.org Haynes JD (2015). A Primer on Pattern-Based Approaches to fMRI: Principles, Pitfalls, and Perspectives. Neuron, 87 (2), 257-70 PMID: 26182413