# Permutation testing for differential expression analysis

Permutation testing is a pretty common non-parametric statistical test used to test how unlikely a particular outcome is. It involves calculating a statistic of interest from your data, followed by several rounds of randomly shuffling sample labels and recalculating the test statistic. Observing the proportion of times a random outcome is as extreme as the originally calculated outcome helps to get a sense for how likely that outcome is the result of random chance.

I’ve recently been involved in an RNA-Seq analysis, particularly looking at differential expression in paper wasps. With 6 biological replicates each from 2 different castes (queens and workers), I measured expression values and used these to identify genes expressed differentially between the castes. There were a few concerns with the preliminary analyses, so in addition to exploring the data in a bit more depth we also wanted to test the reliability of the analysis I had just done. Enter the permutation test.

The idea here is simple. The first step is to run the differential expression analysis with the correct sample labels (which I already did). Next, randomly shuffle the sample labels so that some “queen” samples are labeled “worker” and vice versa. Now, run the differential expression analysis again and note the number of genes designated as differentially expressed. Then simply repeat the label shuffling and re-analysis enough times to get a good idea of what’s going on.

The expectation is that if I label some of my “queens” as “workers” and some of my “workers” as “queens”, and then try to look for consistent queen-vs-worker differences in gene expression, the mislabeled samples will cancel each other out to a large extent and confound the analysis. Due to technical and biological variation, we expect to still identify *some* differentially expressed genes even with shuffled labels, but nowhere near the number we identify when using the correct labeling. If random permutations of the sample labels frequently produce similar number of differentially expressed genes as the correct labeling, there may be issues with the data that need additional attention. This was the case with my analysis, and I wonder how many scientists out there conduct RNA-Seq experiments without doing such basic quality assessment as this.

The script I implemented to facilitate permutation testing for my RNA-Seq analysis is available at https://github.com/standage/dept. The good news is that many of the most popular differential expression analysis packages are freely available from R/Bioconductor and use a similar input format, so if your data are already formatted for differential expression analysis there’s a good chance this script will work out-of-the-box for you.