Anyone possessing more than a passing familiarity with gene annotation will understand the frustration I frequently experience when trying to analyze or compare GFF3 files obtained from multiple sources. While the syntactical rules established by the GFF3 specification are clear and consistently observed, there is an enormous amount of flexibility possible when deciding which exact terms and relationships to utilize when formatting data with GFF3. For example, consider the following gene structure annotation.
chr8 CpGAT gene 22053 23448 . + . ID=chr8.g3;Name=chr8.g3 chr8 CpGAT mRNA 22053 23448 . + . ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1 chr8 CpGAT exon 22053 22550 . + . Parent=chr8.g3.t1 chr8 CpGAT CDS 22167 22550 . + 0 Parent=chr8.g3.t1 chr8 CpGAT CDS 22651 23022 . + 0 Parent=chr8.g3.t1 chr8 CpGAT exon 22651 23448 . + . Parent=chr8.g3.t1
This annotation makes it clear that we have a gene with a single transcription product. There are 2 exons, and although UTRs are not explicitly defined, they can be inferred from the exonic coordinates that do not overlap with the specified coding sequence. Now consider an alternative representation.
chr8 CpGAT gene 22053 23448 . + . ID=chr8.g3;Name=chr8.g3 chr8 CpGAT mRNA 22053 23448 . + . ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1 chr8 CpGAT five_prime_UTR 22053 22166 . + . Parent=chr8.g3.t1 chr8 CpGAT CDS 22167 22550 . + 0 Parent=chr8.g3.t1 chr8 CpGAT CDS 22651 23022 . + 0 Parent=chr8.g3.t1 chr8 CpGAT three_prime_UTR 23023 23448 . + . Parent=chr8.g3.t1
In this representation, the UTRs have been explicitly defined, and although the exons have not, their coordinates can be inferred from the UTRs and coding sequence. Of course, there are additional alternative representations.
chr8 CpGAT gene 22053 23448 . + . ID=chr8.g3;Name=chr8.g3 chr8 CpGAT mRNA 22053 23448 . + . ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1 chr8 CpGAT exon 22053 22550 . + . Parent=chr8.g3.t1 chr8 CpGAT start_codon 22167 22169 . + . Parent=chr8.g3.t1 chr8 CpGAT exon 22651 23448 . + . Parent=chr8.g3.t1 chr8 CpGAT stop_codon 23020 23022 . + . Parent=chr8.g3.t1
So which of these representations is correct? As far as the GFF3 spec is concerned, they all are! Depending on the exact data you are interested in extracting, one of the representations may be more convenient than the others, but they all encode the same information.
One benefit of the GFF3 format is that it leverages the Sequence Ontology, which provides a very detailed and comprehensive description of entities within the realm of biological sequences and the semantics of the relationships between these entities. For example, the relationship between an mRNA and its associated coding sequence is captured in the SO by three terms (nodes) and two relationships (edges):
CDS is_a mRNA_region and
mRNA_region part_of mRNA. In the GFF3 samples above, the association of each mRNA feature with its corresponding CDS feature is valid, as the graph representing the ontology includes a path from the term
CDS to the term
What seems to be lacking here is a mechanism for enforcing additional constraints on terms and relationships for specific contexts or use cases. To continue with the previous example, the SO can validate the relationship between CDS and mRNA, but it is completely silent as to whether an mRNA feature is valid without also explicitly defining its corresponding coding sequence, or whether a CDS feature is valid on its own without associating it with an mRNA feature. While I think the ability to encode so many types of data using GFF3+SO is one of its strengths, the lack of a mechanism for this additional layer of validation is a weakness, and the cause of a lot of the frustrations I’m documenting here.
In a previous life I worked quite extensively with XML data, and I can draw some pretty useful parallels from that experience to my experience with gene annotation data. XML can be used to store data for anything from financial records to cooking recipes to multimedia metadata (and genome annotations of course!). As one might expect, however, an XML document containing financial records will by necessity look quite different from one used to store recipes. The XML specification dictates the syntax that XML files must use, but there is an infinite amount of flexibility with regards to the data types used and the structure and organization of the data.
Declaring an XML document “correct” requires two things: verifying that the data are well-formed, and verifying that the data are valid. Verifying that an XML document is well-formed is as simple as checking it against the official XML specification and ensuring that it obeys all of the syntax rules defined therein. Indeed, most software used to process XML data have well-formedness check built into the XML parser and will fail gracefully if the document is not well formed. However, verifying that an XML file is valid requires checking it against a schema that defines which data types are valid and how the data should be structured. XML schemas are very context-specific and cannot be found in the main specification. Rather, they are typically produced by individual communities of practice for use within that community. Using this mechanism, bank IT personnel do not need to worry about what makes an XML-encoded recipe valid, and recipe bloggers do not need to worry about what makes XML-encoded financial records valid.
A mechanism analogous to validating XML files via a schema, applied to sequence annotations, could mitigate a lot of the difficulty and frustration associated with sharing annotation data. The flexibility provided by GFF3 is definitely a good thing–surely it’s better than coming up with new half-baked formats for each different type of data as we continue to find new useful ways to annotate sequences. But there needs to be a way to formally place additional constraints on a GFF3 file for use in a particular context. Rather than trying to anticipate every input contingency under the sun (which has recently become my approach), developers of analysis tools could instead provide a schema with which a user could validate their input data. Scientists interested only in TE annotations could safely ignore the formatting conventions that others are using to annotate protein-coding genes or epigenetic marks or a variety of other data types. Of course, solving the technical issue of developing such a validation scheme (or leveraging an existing one) is quite different from achieving wide adoption within the community, but until we address this issue we can continue to expect the same types of inconsistency and frustration with annotation data.