VAnG: schema-based validation of genome annotations

I am very excited to attend Cold Spring Harbor Lab’s 2013 Genome Informatics conference. I attended in 2011, and it is by far the best meeting I’ve attended as a graduate student. I’m hoping with 2 additional years under my belt, it will be that much more enriching. And with an organizer/session chair lineup that includes 3 of my top pics for potential postdoc advisers (as well as other great scientists I know by reputation), there is a lot of potential for networking!

For a while, my plan has been to use this conference as an opportunity to present my work on annotation validation. I’ve written about this topic previously, and I felt like this would give me the chance to actually implement my ideas.

As it turns out, though, I ended up submitting an abstract instead for mRNAmarkup, a tool our group is developing for quality control and annotation of de novo assembled transcriptomes. However, I spent a lot of time brainstorming the validation work and clarifying my language on the topic, and I would hate for that to go to waste. So here is a rough draft of the abstract I had originally planned to submit.

Analyses of genomic sequences rely extensively on annotation of genomic features such as genes and transposable elements within those sequences. The Sequence Ontology provides a structured controlled vocabulary for describing genomic features, and the GFF3 Specification provides a standardized syntax for encoding those features and their subcomponents as an annotation graph. A common issue with the dissemination and use of genome annotations arises from the fact that different alternative representations exist for the same genomic features, implicitly encoding the same information but utilizing a different subset of ontological terms in its representation. The persistence of alternative formatting conventions highlights two related needs: a mechanism for formally describing an explicit annotation structure, and a mechanism for validating an annotation file against a particular structure description. To address this need we have drawn parallels to XML-related technologies and developed a schema-based approach for validating genome annotations. Plain-text schema files describe representations of annotation graphs in terms of node connectivity, facilitating the validation of annotation data. We present VAnG, a tool for validating genome annotations, and discuss the implications of this tool for the dissemination and consumption of genome annotations.

There are two spots where I feel this draft still needs some work. First, the sentence beginning with “The persistence…” includes some vague language that should be improved. Second, the phrase “facilitating the validation of annotation data” adds very little value to the abstract and should be replaced with something more informative.

I don’t think it will take me too long to implement VAnG once I have some time to dedicate to it. I have already implemented the schema format and corresponding data structures for parsing schemas as part of the AEGeAn Toolkit. When it comes time to publishing this work, I hope this abstract will provide a good starting point to communicating the need for this tool and the benefit it provides.


