Validating sequence annotations

Anyone possessing more than a passing familiarity with gene annotation will understand the frustration I frequently experience when trying to analyze or compare GFF3 files obtained from multiple sources. While the syntactical rules established by the GFF3 specification are clear and consistently observed, there is an enormous amount of flexibility possible when deciding which exact terms and relationships to utilize when formatting data with GFF3. For example, consider the following gene structure annotation.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   exon    22053   22550   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22167   22550   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22651   23022   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   exon    22651   23448   .       +       .       Parent=chr8.g3.t1

This annotation makes it clear that we have a gene with a single transcription product. There are 2 exons, and although UTRs are not explicitly defined, they can be inferred from the exonic coordinates that do not overlap with the specified coding sequence. Now consider an alternative representation.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   five_prime_UTR  22053   22166   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22167   22550   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22651   23022   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   three_prime_UTR 23023   23448   .       +       .       Parent=chr8.g3.t1

In this representation, the UTRs have been explicitly defined, and although the exons have not, their coordinates can be inferred from the UTRs and coding sequence. Of course, there are additional alternative representations.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   exon    22053   22550   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   start_codon     22167   22169   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   exon    22651   23448   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   stop_codon      23020   23022   .       +       .       Parent=chr8.g3.t1

So which of these representations is correct? As far as the GFF3 spec is concerned, they all are! Depending on the exact data you are interested in extracting, one of the representations may be more convenient than the others, but they all encode the same information.

One benefit of the GFF3 format is that it leverages the Sequence Ontology, which provides a very detailed and comprehensive description of entities within the realm of biological sequences and the semantics of the relationships between these entities. For example, the relationship between an mRNA and its associated coding sequence is captured in the SO by three terms (nodes) and two relationships (edges): CDS is_a mRNA_region and mRNA_region part_of mRNA. In the GFF3 samples above, the association of each mRNA feature with its corresponding CDS feature is valid, as the graph representing the ontology includes a path from the term CDS to the term mRNA.

What seems to be lacking here is a mechanism for enforcing additional constraints on terms and relationships for specific contexts or use cases. To continue with the previous example, the SO can validate the relationship between CDS and mRNA, but it is completely silent as to whether an mRNA feature is valid without also explicitly defining its corresponding coding sequence, or whether a CDS feature is valid on its own without associating it with an mRNA feature. While I think the ability to encode so many types of data using GFF3+SO is one of its strengths, the lack of a mechanism for this additional layer of validation is a weakness, and the cause of a lot of the frustrations I’m documenting here.

In a previous life I worked quite extensively with XML data, and I can draw some pretty useful parallels from that experience to my experience with gene annotation data. XML can be used to store data for anything from financial records to cooking recipes to multimedia metadata (and genome annotations of course!). As one might expect, however, an XML document containing financial records will by necessity look quite different from one used to store recipes. The XML specification dictates the syntax that XML files must use, but there is an infinite amount of flexibility with regards to the data types used and the structure and organization of the data.

Declaring an XML document “correct” requires two things: verifying that the data are well-formed, and verifying that the data are valid. Verifying that an XML document is well-formed is as simple as checking it against the official XML specification and ensuring that it obeys all of the syntax rules defined therein. Indeed, most software used to process XML data have well-formedness check built into the XML parser and will fail gracefully if the document is not well formed. However, verifying that an XML file is valid requires checking it against a schema that defines which data types are valid and how the data should be structured. XML schemas are very context-specific and cannot be found in the main specification. Rather, they are typically produced by individual communities of practice for use within that community. Using this mechanism, bank IT personnel do not need to worry about what makes an XML-encoded recipe valid, and recipe bloggers do not need to worry about what makes XML-encoded financial records valid.

A mechanism analogous to validating XML files via a schema, applied to sequence annotations, could mitigate a lot of the difficulty and frustration associated with sharing annotation data. The flexibility provided by GFF3 is definitely a good thing–surely it’s better than coming up with new half-baked formats for each different type of data as we continue to find new useful ways to annotate sequences. But there needs to be a way to formally place additional constraints on a GFF3 file for use in a particular context. Rather than trying to anticipate every input contingency under the sun (which has recently become my approach), developers of analysis tools could instead provide a schema with which a user could validate their input data. Scientists interested only in TE annotations could safely ignore the formatting conventions that others are using to annotate protein-coding genes or epigenetic marks or a variety of other data types. Of course, solving the technical issue of developing such a validation scheme (or leveraging an existing one) is quite different from achieving wide adoption within the community, but until we address this issue we can continue to expect the same types of inconsistency and frustration with annotation data.

Advertisements

3 comments

  1. Pingback: GFF3 101: multi-line features and multiple parents | BioWize
  2. Pingback: VAnG: schema-based validation of genome annotations | BioWize
  3. Pingback: Validating genome annotations revisited: gt speck | BioWize

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s