There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…
The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.
I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.
The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.
I’ve blogged before about how amazingly useful the
paste command can be for working with Fastq data. In fact, combining
paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!
paste <(paste - - - - < reads-1.fastq) \ <(paste - - - - < reads-2.fastq) \ | tr '\t' '\n' \ > reads-int.fastq
However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the
split-paired-reads.py script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses
paste and process substitution, but it also takes advantage of the oft-overlooked
tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).
paste - - - - - - - - < reads-int.fastq \ | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \ | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq
This procedure did the conversion in 2.5 minutes, while I terminated the incomplete
split-paired-reads.py procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.
*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.
‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.