UNIX sort: sorting with both numeric and non-numeric keys

Tabular data is the bread and butter of biology and bioinformatics. Comma- or tab-separated values are easy to write, easy to read, and most importantly they play nice with UNIX shell tools. The UNIX sort command is particularly useful for–you guessed it–sorting tabular data.

Consider the following example, a dummy data set with alphanumeric data in the first and second columns, and numeric data in the third column.

isoform10	False	10.89
isoform1	True	21.98
isoform10	True	3.55
isoform7	True	0.67
isoform9	False	0.99
isoform7	False	0.66
isoform3	True	19.01
isoform7	True	2.48
isoform12	False	11.53
isoform4	True	1.73
isoform12	True	4.30
isoform3	False	0.25

We can sort this data by the first column like so. Note that this does a lexicographical sort, which places isoform10 before isoform2, etc.

[standage@lappy ~] sort -k1,1 mydata.tsv 
isoform1	True	21.98
isoform10	False	10.89
isoform10	True	3.55
isoform12	False	11.53
isoform12	True	4.30
isoform3	False	0.25
isoform3	True	19.01
isoform4	True	1.73
isoform7	False	0.66
isoform7	True	0.67
isoform7	True	2.48
isoform9	False	0.99

Sorting by the third column works the same way, although we get wonky results (i.e. a lexicographical sort instead of a numeric sort) if we don’t specify that it’s numeric data.

[standage@lappy ~] sort -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform7	True	2.48
isoform1	True	21.98
isoform10	True	3.55
isoform12	True	4.30

We need to use the -n flag to indicate that you want a numeric sort.

[standage@lappy ~] sort -n -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform7	True	2.48
isoform10	True	3.55
isoform12	True	4.30
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform1	True	21.98

What if we now want to sort by two columns? The sort command allows you to specify multiple keys, but if we use the -n flag it will apply to all keys, leading to more wonky results.

[standage@lappy ~] sort -n -k1,1 -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform7	True	2.48
isoform10	True	3.55
isoform12	True	4.30
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform1	True	21.98

Man pages to the rescue! Any of the sort program’s flags (such as -n for numeric sort or -r for reverse sort) can be added to the end of a key declaration so that it applies only to that key. Therefore, if we so desired, we could apply a reverse lexicographical sort to the first column and a numerical sort to the third column in this fashion.

[standage@lappy ~] sort -k1,1r -k3,3n mydata.tsv 
isoform9	False	0.99
isoform7	False	0.66
isoform7	True	0.67
isoform7	True	2.48
isoform4	True	1.73
isoform3	False	0.25
isoform3	True	19.01
isoform12	True	4.30
isoform12	False	11.53
isoform10	True	3.55
isoform10	False	10.89
isoform1	True	21.98

That’s it!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s