The fastest darn Fastq decoupling procedure I ever done seen

There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…

The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.

I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.

The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.

I’ve blogged before about how amazingly useful the paste command can be for working with Fastq data. In fact, combining paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!

paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) \
    | tr '\t' '\n' \
    > reads-int.fastq

However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses paste and process substitution, but it also takes advantage of the oft-overlooked tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).

paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

This procedure did the conversion in 2.5 minutes, while I terminated the incomplete procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.

*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.

‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.



    • Daniel Standage

      As pleased as I am with this discovery, I’m not sure how realistic it is to integrate with khmer. It’s so fast primarily because it’s so stripped down and doesn’t support reading gzip/bzip2 files, Fasta files, and so on: features that inevitably will be useful to many users.

  1. Alex Stark

    Hi Daniel,
    cool trick, thanks. I’m not sure what you mean by “doesn’t support reading gzip files” though. While to my knowledge you cannot nest the process substitutions (), you can still read and write gzip files by piping from/to g(un)zip within the process substitutions. The two conversions above would be:

    # 1. interleave
    paste <(gunzip -c reads-1.fastq.gz | paste - - - -) <(gunzip -c reads-2.fastq.gz | paste - - - -) | tr '\t' '\n' | gzip -c > reads-int.fastq.gz
    # 2. decouple
    gunzip -c reads-int.fastq.gz | paste - - - - - - - - | tee >(cut -f 1-4 | tr '\t' '\n' | gzip > reads-1.fastq.gz) | cut -f 5-8 | tr '\t' '\n' | gzip -c > reads-2.fastq.gz
    # (you can also use "zcat" instead of "gunzip -c")

    I think this should work – I try to avoid reading/writing unzipped fastq files at all costs (and am a big fan of one line bash/awk tricks).

    • Daniel Standage

      You are correct. 🙂 What I meant was that the exact command did not work for compressed files, not that the general approach did not work. In fact, the gist I link to includes both cases. I included only the uncompressed case for sake of simplicity.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s