Category: Tools

Simulating BS-seq experiments with Sherman

The bioinformatics arm of the Babraham Institute produces quite a few software packages, including some like FastQC that have achieved nearly ubiquitous adoption in the academic community. Other offerings from Babraham include Bismark for methylation profiling and Sherman for simulating NGS reads from bisulfite-treated DNA.

In my computational genome science class, there was a student that wanted to measure the accuracy of methylation profiling workflows, and he identified Bismark and Sherman as his tools of choice. He would identify a sequence of interest, use Sherman to simulate reads from that sequence, run the methylation call procedure, and then assess the accuracy of the methylation calls.

There was a slight problem, though: Sherman is random in its conversion of Cs to Ts, and the number of conversions can only be controlled by things like error rate and conversion rate. By default, Sherman provides no way to “protect” a C from conversion the way a methyl group does on actual genomic DNA. So there was no way to assess the accuracy of the methylation profiling workflow since we had no way of indicating/knowing which Cs should be called methylated!

After a bit of chatting, however, we came up with a workaround. In our genomic DNA fasta file, any Cs we want to protect from conversion (i.e. methylate them in silico) we simply convert to X. Then we run Sherman, which will convert Cs to Ts at the specified conversion rate but will leave Xs alone. Then, after simulating the reads but before the methylation analysis procedure, we simply change Xs back to Cs. This seemed to pass some sanity tests for us, and we contacted the author of Sherman, Felix Krueger (@FelixNmnKrueger), who confirmed that he saw no potential pitfalls with the approach.

I like this little hack, and assuming I ever use it myself in the future, I will probably create a script that does the C -> X conversion from a GFF3 file or a list of “methylation sites” in some other format (the conversion from X back to C is trivial).

Frustration with Word and Endnote on Mac

Recently, I’ve been using Microsoft Word and EndNote to write a significant paper for the first time in several years (my advisor and I used LaTex + git mor my first first-author paper). After using it on my MacBook for several weeks with no more than the usual amount of frustration one can expect from EndNote and Word, EndNote stopped working all of a sudden. Every time I tried to insert a reference, it would get frozen at the “Formatting Bibliography” step and hang indefinitely. Force-quitting and restarting the programs didn’t seem to help anything.

After a bit of searching, I came across this thread which provides a simple solution. The culprit for the unstable behavior seems to ba an OS X system process called appleeventsd, and force quitting the process with this name using the System Monitor restored normal Word/EndNote behavior. I have done this several times in the last couple of weeks and haven’t seen any adverse side effects, so I will continue to do so until something goes wrong or some OS 10.8 upgrade provides better stability or until my collaborators magically decide that LaTeX + git + BitBucket is in fact a superior solution after all!

Compression with pigz

Yes, another post about compression programs. No, data compression is not an area of particular research interest for me, but I’ve been dealing with so much data recently that I’m really looking for better and quicker ways to compress, decompress, and transfer data.

The zlib website hosts the home page for pigz, a parallel implementation of the UNIX program gzip. It compiled very quickly and cleanly out-of-the-box on several platforms (Fedora, Red Hat, OS X) and works just like gzip, bzip2, or any other compression program would on the command line.

# Here is how I would compress a tarball...
tar cf - $DATA/*.fastq | pigz -p 16 > $WD/RawReads.tar.gz
# ...and here is how you would decompress.
pigz -p 16 -d -c $WD/RawReads.tar.gz | tar x

The performance improvement is significant, so initially I was very excited about this finding. However, after a few uses I did encounter a case in which I had issues decompressing a particularly large tarball that had been created with pigz. It appears that the tarball was corrupted somehow during the compression process.

Definitely a program worth checking out. I’m cautiously optimistic that my troubles have just been a fluke or the result of some mistake on my part, but I’m not betting the farm on it yet.

DELIMINATE: comparison with gzip, bzip2, and zip

Yesterday, I saw an abstract in my RSS feed for a new compression algorithm called DELIMINATE. Two things from the abstract caught my attention. First, it claimed to be lossless—I’ve seen abstracts for lossy compression algorithms before, but honestly I have no interest in introducing any additional error into any Illumina data, whose quality is already somewhat crappy. It also claimed improvements in compression speed and compression ratio.

I figured I would give the software a try. The download process was nearly painless (I did have to provide my name and email), and the software “just worked” as advertised out-of-the-box. I was able to confirm that the compression is indeed lossless, and that the compression speed and ratio is indeed superior to all the other commonly-used compression programs/algorithms (I recorded a terminal session of the tests in this asciicast). The decompression time, however, was quite a bit slower than all the alternatives. This, in addition to the practical consideration of installing yet another compression program on every single computer you use frequently (and forcing my colleagues to do the same), is the only hesitation I have regarding DELIMINATE.

Trimmomatic for cleaning NGS reads

Trimming and cleaning are important first steps in any analysis involving Illumina short read data. I’m somewhat familiar with the experimental protocols and why trimming, clipping, and filtering need to be done, but honestly I haven’t invested too much effort in familiarizing myself with the details of various implementations. I’ve essentially just used an R script from a colleague thus far for all my NGS preprocessing needs.

This week I came across a link for the Trimmomatic tool, which seems to be a well-implemented and reliable tool. I’m interested in comparing the performance of Trimmomatic versus my R script. I’m assuming Trimmomatic will be much faster and produce results of at least the same quality (if not better), in which case I would be happy to scrap the R script for good!

Error correction with AllPaths-LG

My experience with the AllPaths-LG genome assembler has been great. Anecdotal evidence as well as some rigorous studies both seem to corroborate this. Earlier this week, I came across a link to this post on the AllPaths-LG blog.

The ALLPATHS-LG fragment read error correction modules are now available for stand alone use outside the full assembly pipeline. You can now easily error correct your own data using our method without running an assembly.

I’m glad I saw this when I did. I recently read over Steven Salzberg et al.‘s GAGE evaluation, and remembered the paper mentioning that for many (most?) assemblers, using AllPaths-LG’s error correction module as opposed to QUAKE gave better results. I’m definitely interested to try this out.