Friday, August 22, 2014

quick recommendation: pre-calculate permutation test schemes

A quick recommendation: I strongly suggest pre-calculating the relabeling schemes before running a permutation test. In other words, prior to actually running the code doing all the calculations necessary to generate the null distribution, determine which relabeling will be used for each iteration of the permutation test, and store these new labels so that they can be read out again. To be clear, I think the only alternative to pre-calculating the relabeling scheme is to generate them at run time, such as by randomly resampling a set of labels during each iteration of the permutation test; that's not what I'm recommending here.

There are several reasons I think this is a good principle to follow for any "serious" permutation test (e.g. one that might end up in a publication):

Safety and reproducibility. It's a lot easier to confirm that the relabeling scheme is operating as expected when it can be checked outside of debug mode/run time. At minimum, I check that there are no duplicate entries, and that the randomization looks reasonable (e.g. labels chose at approximately equal frequencies?). Having the relabeling stored also means that the same permutation test can be run at a later time, even if the software or machines have changed (built-in randomization functions are not always guaranteed to produce the same output with different machines or versions of the software).

Easy of separating the jobs. I am fortunate to have access to an excellent supercomputing cluster. Since my permutations are pre-calculated, I can run a permutation test quickly by sending many separate, non-interacting jobs to the different cluster computers. For example, I might start one job that runs permutations 1 to 20, another job running permutations 21 to 30, etc. In the past I've tried running jobs like this by setting random seeds, but it was much more buggy than explicitly pre-calculating the labelings. Relatedly, if a job crashes for some reason it's a lot better to be able to start after the last-completed permutation if they've been pre-calculated.

No comments:

Post a Comment