Shuffling columns of a tabular file with cut and paste

I’ve been working on an RNA-Seq analysis recently, and discussing the results with my advisor. Based on his suggestion, I’ve decided to try to get a baseline proportion of transcripts reported as differentially expressed by shuffling labels and re-running the analysis. This type of approach–a permutation test–is quite common in statistics.

The input to the differential expression software I am using (and indeed most others) is a tabular plain text file where the first column contains a molecule label/ID and each subsequent column contains that molecule’s expression level for a given sample. Since the software I’m using requires all samples corresponding to a condition to be adjacent, shuffling the condition label means shuffling the columns of this table.

I’ve talked about the cut and paste commands before (here and here), but I’ve perhaps missed their canonical usage until now. Here are the steps that I used to create shuffled files for the permutation test.

  1. Create a new file for each column of the table using the cut command: cut -f 1 for the molecule IDs, cut -f 2 for the expression levels for the first sample, cut -f 3 for the expression levels for the second sample, etc.
  2. Use the paste command to put the columns back together in a new order.

I created an asciicast demonstrating this on a small dummy data set: see http://asciinema.org/a/5714.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s