Category: Ontologies

VAnG: schema-based validation of genome annotations

I am very excited to attend Cold Spring Harbor Lab’s 2013 Genome Informatics conference. I attended in 2011, and it is by far the best meeting I’ve attended as a graduate student. I’m hoping with 2 additional years under my belt, it will be that much more enriching. And with an organizer/session chair lineup that includes 3 of my top pics for potential postdoc advisers (as well as other great scientists I know by reputation), there is a lot of potential for networking!

For a while, my plan has been to use this conference as an opportunity to present my work on annotation validation. I’ve written about this topic previously, and I felt like this would give me the chance to actually implement my ideas.

As it turns out, though, I ended up submitting an abstract instead for mRNAmarkup, a tool our group is developing for quality control and annotation of de novo assembled transcriptomes. However, I spent a lot of time brainstorming the validation work and clarifying my language on the topic, and I would hate for that to go to waste. So here is a rough draft of the abstract I had originally planned to submit.

Analyses of genomic sequences rely extensively on annotation of genomic features such as genes and transposable elements within those sequences. The Sequence Ontology provides a structured controlled vocabulary for describing genomic features, and the GFF3 Specification provides a standardized syntax for encoding those features and their subcomponents as an annotation graph. A common issue with the dissemination and use of genome annotations arises from the fact that different alternative representations exist for the same genomic features, implicitly encoding the same information but utilizing a different subset of ontological terms in its representation. The persistence of alternative formatting conventions highlights two related needs: a mechanism for formally describing an explicit annotation structure, and a mechanism for validating an annotation file against a particular structure description. To address this need we have drawn parallels to XML-related technologies and developed a schema-based approach for validating genome annotations. Plain-text schema files describe representations of annotation graphs in terms of node connectivity, facilitating the validation of annotation data. We present VAnG, a tool for validating genome annotations, and discuss the implications of this tool for the dissemination and consumption of genome annotations.

There are two spots where I feel this draft still needs some work. First, the sentence beginning with “The persistence…” includes some vague language that should be improved. Second, the phrase “facilitating the validation of annotation data” adds very little value to the abstract and should be replaced with something more informative.

I don’t think it will take me too long to implement VAnG once I have some time to dedicate to it. I have already implemented the schema format and corresponding data structures for parsing schemas as part of the AEGeAn Toolkit. When it comes time to publishing this work, I hope this abstract will provide a good starting point to communicating the need for this tool and the benefit it provides.

Validating sequence annotations

Anyone possessing more than a passing familiarity with gene annotation will understand the frustration I frequently experience when trying to analyze or compare GFF3 files obtained from multiple sources. While the syntactical rules established by the GFF3 specification are clear and consistently observed, there is an enormous amount of flexibility possible when deciding which exact terms and relationships to utilize when formatting data with GFF3. For example, consider the following gene structure annotation.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   exon    22053   22550   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22167   22550   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22651   23022   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   exon    22651   23448   .       +       .       Parent=chr8.g3.t1

This annotation makes it clear that we have a gene with a single transcription product. There are 2 exons, and although UTRs are not explicitly defined, they can be inferred from the exonic coordinates that do not overlap with the specified coding sequence. Now consider an alternative representation.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   five_prime_UTR  22053   22166   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22167   22550   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22651   23022   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   three_prime_UTR 23023   23448   .       +       .       Parent=chr8.g3.t1

In this representation, the UTRs have been explicitly defined, and although the exons have not, their coordinates can be inferred from the UTRs and coding sequence. Of course, there are additional alternative representations.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   exon    22053   22550   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   start_codon     22167   22169   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   exon    22651   23448   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   stop_codon      23020   23022   .       +       .       Parent=chr8.g3.t1

So which of these representations is correct? As far as the GFF3 spec is concerned, they all are! Depending on the exact data you are interested in extracting, one of the representations may be more convenient than the others, but they all encode the same information.

One benefit of the GFF3 format is that it leverages the Sequence Ontology, which provides a very detailed and comprehensive description of entities within the realm of biological sequences and the semantics of the relationships between these entities. For example, the relationship between an mRNA and its associated coding sequence is captured in the SO by three terms (nodes) and two relationships (edges): CDS is_a mRNA_region and mRNA_region part_of mRNA. In the GFF3 samples above, the association of each mRNA feature with its corresponding CDS feature is valid, as the graph representing the ontology includes a path from the term CDS to the term mRNA.

What seems to be lacking here is a mechanism for enforcing additional constraints on terms and relationships for specific contexts or use cases. To continue with the previous example, the SO can validate the relationship between CDS and mRNA, but it is completely silent as to whether an mRNA feature is valid without also explicitly defining its corresponding coding sequence, or whether a CDS feature is valid on its own without associating it with an mRNA feature. While I think the ability to encode so many types of data using GFF3+SO is one of its strengths, the lack of a mechanism for this additional layer of validation is a weakness, and the cause of a lot of the frustrations I’m documenting here.

In a previous life I worked quite extensively with XML data, and I can draw some pretty useful parallels from that experience to my experience with gene annotation data. XML can be used to store data for anything from financial records to cooking recipes to multimedia metadata (and genome annotations of course!). As one might expect, however, an XML document containing financial records will by necessity look quite different from one used to store recipes. The XML specification dictates the syntax that XML files must use, but there is an infinite amount of flexibility with regards to the data types used and the structure and organization of the data.

Declaring an XML document “correct” requires two things: verifying that the data are well-formed, and verifying that the data are valid. Verifying that an XML document is well-formed is as simple as checking it against the official XML specification and ensuring that it obeys all of the syntax rules defined therein. Indeed, most software used to process XML data have well-formedness check built into the XML parser and will fail gracefully if the document is not well formed. However, verifying that an XML file is valid requires checking it against a schema that defines which data types are valid and how the data should be structured. XML schemas are very context-specific and cannot be found in the main specification. Rather, they are typically produced by individual communities of practice for use within that community. Using this mechanism, bank IT personnel do not need to worry about what makes an XML-encoded recipe valid, and recipe bloggers do not need to worry about what makes XML-encoded financial records valid.

A mechanism analogous to validating XML files via a schema, applied to sequence annotations, could mitigate a lot of the difficulty and frustration associated with sharing annotation data. The flexibility provided by GFF3 is definitely a good thing–surely it’s better than coming up with new half-baked formats for each different type of data as we continue to find new useful ways to annotate sequences. But there needs to be a way to formally place additional constraints on a GFF3 file for use in a particular context. Rather than trying to anticipate every input contingency under the sun (which has recently become my approach), developers of analysis tools could instead provide a schema with which a user could validate their input data. Scientists interested only in TE annotations could safely ignore the formatting conventions that others are using to annotate protein-coding genes or epigenetic marks or a variety of other data types. Of course, solving the technical issue of developing such a validation scheme (or leveraging an existing one) is quite different from achieving wide adoption within the community, but until we address this issue we can continue to expect the same types of inconsistency and frustration with annotation data.

A plug for ontologies

An ontology is a formal (mathematical) representation of knowledge for a particular information domain: the types of objects that exist in that domain, and the relationships between those objects. Formally organizing terms using a controlled vocabulary and taxonomy makes it possible for us to exchange and analyze information in that domain in a way that computers can understand.

So what does this have to do with biology? My writing of this note was prompted by two things: first, I found this abstract in my RSS feed earlier this week, and then I saw this question on a bioinformatics Q&A site of which I’m a regular participant. Ontologies seem to be gaining substantial traction in biology and bioinformatics.

Anyone who has not been living under a rock for the last several decades knows that the life sciences are increasingly becoming an information- and data-driven science. This is certainly the case in such fields as genomics, where working with gigabytes of raw high-throughput sequence data is routine. But even in other branches of biology (traditionally characterized by tedious and complicated assays designed to tease out small, precious nuggets of truth), advances in technology and the accumulation of decades of research is putting huge amounts of information at the fingertips of to practically any scientist that cares to use it. There is no way to take advantage of these data without computers, but computers will not be much help unless we can formulate our biological problems in formal (mathematical) terms.

Perhaps the most popular and successful ontologies in the life sciences are the Gene Ontology and the Sequence Ontology. I’ve worked quite a bit with the Sequence Ontology, which originated from a need to intelligently exchange annotations describing biological sequences. You can imagine how difficult it could be to share information between different genome sequencing and annotation projects if each used different conventions for describing the data. For example, what project A calls a “transcript” is called an “mRNA” by project B, which is called just “RNA” by project C. Of course, any scientist that takes a moment to look at this can figure it out quickly, but we cannot expect a computer to understand this relationship unless we tell it. The Sequence Ontology provides a controlled vocabulary for describing biological sequences, as well as relationships between terms (e.g., a “coding sequence” is associated with an “mRNA”, which is associated with a “gene”).

Biological ontologies are not limited to describing genes and sequences, however. The OBO Foundry supports a handful of ontologies for various subdomains of biological inquiry, encompassing everything from the finest molecular details to course-grained systematics and anatomy. OBO also has a list of dozens of other biological and biomedical ontologies that are either candidates for OBO support or that are externally developed and simply recognized by OBO.

As biology continues to become an information science, ontologies are going to play an increasingly critical role in our representation, exchange, and analysis of biological data.