In next-generation sequence data, duplicates of non-biological origin (such as PCR duplicates) are fairly common and can complicate tasks such as assembly and expression quantification, and jeopardize the accuracy of the results. If reads are sampled at random from a genome, it’s very unlikely that two reads will be sampled from exactly the same position. For paired-end data especially, there is essentially no other explanation for two read pairs being identical other than some kind of amplification or other technical artifact. Obviously, it’s important to scan our NGS data for duplicates and remove them as part of our preliminary quality control. This is a pretty simple task, so software solutions should abound, right? Right!?!?
Well, it’s very possible that I’m missing something, but I’ve only been able to find a single tool for removal of duplicate reads (FastUniq, described in this PLoS ONE paper). I did see tools for removal of duplicate read alignments, but that’s not going to help much if you’re working on a de novo assembly. I went ahead and downloaded FastUniq, ran it on my data, and verified the results. The user interface was a bit kludgy, but boy have I seen worse!! At least it was well documented and worked as advertised.
Later, I was working with two different annotated genome assemblies for the same organism, which I had parsed into individual loci based on gene content. I was interested in determining how many loci were strictly identical between the two assemblies. As I started writing code to do this task, I realized how similar it was to the simple task of removing duplicate reads. In duplicate read removal, we store read sequences in memory, and then whenever we encounter a sequence we’ve already seen before we discard it. Now, for identifying identical loci, the task was similar, but instead I wanted to report “duplicates” instead of discarding them1.
After realizing this, I took a few minutes to extend and generalize my code so that it could handle both use cases. I then tested it on a data set I had just processed with FastUniq. My code ran in 29 minutes, whereas FastUniq had required 69 minutes. This does not include the time required to convert between paired and interleaved data (FastUniq requires paired files, my code requires interleaved), nor does it include the time required to shuffle the FastUniq output2. So some simple of-the-cuff Python code was able to outperform the only tool I could find for removing duplicates de novo.
I wrapped up my code as a Python package and posted it on Github. It still needs a bit of polishing, which I may or may not ever get around to doing.
1Because the latter approach involves storing longer genomic sequences, I wrote the code to store SHA1 hashes of each sequence, rather than the sequence itself. This drastically reduced memory consumption without an appreciable increase in runtime.
2For some reason that I did not understand from my very cursory skimming of the FastUniq paper, the FastUniq tool includes a sorting step. Therefore, the first read of the output is the read with the most consecutive adenines (As) at the beginning of the read. Looking at the output set off all kinds of red flags and alarms in my brain. Anyone who has taken a course on data structures or algorithms knows that weird things can happen performance-wise when unsorted data is expected but sorted data is provided (and all NGS analysis tools expect unsorted data). I would not be comfortable using FastUniq’s output directly without first shuffling the data, which only increases the margin of improvement of my Python code over FastUniq!