I had the opportunity to attend Cold Spring Harbor’s Genome Informatics conference this year. Here are a couple of my favorite highlights.
Michael Schatz’s presentation briefly mentioned metassembly, but a student at Notre Dame (a collaborator and former intern of Schatz’s) presented a poster dedicated to the subject. He implemented a program called Metassembler, which takes as input 2 different assemblies of the same data (perhaps from different assemblers, or the same assembler using two different parameter settings) to derive a consensus assembly of superior quality to the two input assemblies.
When speaking to the student presenting the poster, he said it would be a couple of weeks before the code was ready for distribution. Given that their wiki has not been updated since before the conference, I’m not holding my breath…although I will be very interested to try this software out when it is available.
Another poster I enjoyed was presented by a student (undergrad?) of Dr. John Karro of Miami University Ohio. The student implemented a Hidden Markov Model to identify alternative sites of polyadenlyation in transcripts. The HMM was pretty simple, but I enjoyed discussing the relevant biology of which I was not previously aware.
One of the main sessions included a presentation about the Assemblathon genome assembly contest (since published in Genome Research). I don’t really remember much about which submissions/methods performed better than the others–what I enjoyed most about this presentation was the discussion about different comparison metrics they developed to measure the relative quality of the submitted genome assemblies. One I remember off the top of my head was the cc50 measure–the “correct contiguity” analog of the n50 measure. Essentially, cc50 measures the distance at which 50% of the contigs (or scaffolds?) in the assembly are situated correctly with reference to the other contigs. They defined several other metrics to assess a variety of important characteristics of assembly quality. This is something I will definitely be going to back to look at in more depth.
Steven Salzberg gave a presentation about the GAGE competition his research group conducted. Unlike the Assemblathon, which accepted community submission, the GAGE project was all conducted by Salzberg’s lab. Essentially, they tested a wide variety of already available genome assemblers on real data and tried to assess the relative performance of each assembler. Rather than trying to drive innovation, this project is trying to address practical questions commonly faced by biologists in this new information age of biology.
The takeaway message I got from the presentation is that using traditional assembly quality metrics (n50, n90, longest scaffold length, etc), SOAPdenovo consistently generated the best results, followed by AllPathsLG. However, for each assembly, there was a high-quality reference assembly available for comparison, so they also assessed the quality of the assemblies when contigs and scaffolds were split in regions containing large amounts of error. For these corrected assemblies, AllPathsLG consistently provided the best performance, followed by SOAPdenovo. At the end of the day, SOAPdenovo provides the largest, (perhaps) most complete assemblies, while AllPathsLG provides assemblies that may be a bit smaller but have far fewer errors.
I enjoyed many other presentations and posters, but I only have so much time to sit and reflect on them now!