I’ve been working on an RNA-Seq analysis recently, and discussing the results with my advisor. Based on his suggestion, I’ve decided to try to get a baseline proportion of transcripts reported as differentially expressed by shuffling labels and re-running the analysis. This type of approach–a permutation test–is quite common in statistics.
The input to the differential expression software I am using (and indeed most others) is a tabular plain text file where the first column contains a molecule label/ID and each subsequent column contains that molecule’s expression level for a given sample. Since the software I’m using requires all samples corresponding to a condition to be adjacent, shuffling the condition label means shuffling the columns of this table.
I’ve talked about the
paste commands before (here and here), but I’ve perhaps missed their canonical usage until now. Here are the steps that I used to create shuffled files for the permutation test.
- Create a new file for each column of the table using the
cut -f 1for the molecule IDs,
cut -f 2for the expression levels for the first sample,
cut -f 3for the expression levels for the second sample, etc.
- Use the
pastecommand to put the columns back together in a new order.
I created an asciicast demonstrating this on a small dummy data set: see http://asciinema.org/a/5714.