RTFM: paste

Recently I’ve used this blog to document my quest for better understanding of basic UNIX commands and their utility in data processing (here here and here). I recently came across another command that has already proven itself extremely useful.

paste

If you read the manual for the paste command or try the default usage via the command line, you will surely be underwhelmed. This is from the man page…

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

…and this is the basic usage.

[standage@lappy ~] cat bogus.data 
dude
sweet
awesome
cool
[standage@lappy ~] paste - < bogus.data 
dude
sweet
awesome
cool
[standage@lappy ~]

Looks like just another cat command, huh? Well, note the dash symbol following the paste command. This is commonly used to indicate that a program/command should read from standard input rather than from a file. The paste magic begins when you start adding additional dashes.

[standage@lappy ~] paste - - < bogus.data 
dude	sweet
awesome	cool
[standage@lappy ~]

This time it read two lines and printed them out on a single line separated by a tab. Increasing the number of dashes will increase the number of lines from the input that paste will print together on one line of output. This is a simple command, but it can be extremely useful for processing data files where a complete record is stored on a fixed number of lines (such as with the fastq format, where a sequence corresponds to 4 lines).

Recently, my advisor had some interleaved Fastq files but wanted to run a software package that expected mate pairs to be placed in separate files. Before searching for a program to split the files or writing one to do it himself, he sent me a quick note asking whether we had already installed any programs on our server that would do this. I responded and suggested he try the following command.

paste - - - - - - - - < interleaved.fastq | \
perl -ne '@v = split(/\t/); printf("%s\n%s\n%s\n%s\n", @v[0..3]); printf(STDERR "%s\n%s\n%s\n%s", @v[4..7]);' \
> 1.fq 2> 2.fq

In this one-liner (well, I spread it over 3 lines for readability), the paste command reads in 8 lines (2 fastq records corresponding to a mate pair) and combines those 8 lines into a single line with 8 tab-delimited values. The Perl command then splits the input, writes the first 4 values to the standard output and the second 4 values to the standard error. Redirect stdout and stderr to different files, and you’ve got your paired Fastq files!

Credit for introducing me to this command goes to this post, which has a couple additional examples.

2 comments

  1. Pingback: Shuffling columns of a tabular file with cut and paste | BioWize
  2. Pingback: The fastest darn Fastq decoupling procedure I ever done seen | BioWize

Leave a comment