Although GFF3 is no doubt the richest and most flexible of the popular tab-delimited text-based genome annotation formats, it unfortunately comes with some baggage. Some of this has to do with the fact that GFF3 looks a lot like a variety of other tab-delimited formats that have more permissive or less flexible formatting rules/conventions (indeed, it was the limitations of these formats that led to the GFF3 specification). Some of this has to do with the fact that most scientists learn what they know about GFF3, at least initially, from examples rather than from the specification, which can be problematic if someone’s first exposure to GFF3 is an incorrect file. Some of this has to do with the fact that while there are several available tools that will validate the syntax of your GFF3 file, there are no generalized tools for checking the content of your GFF3 file (see this previous post for a related discussion).
Despite this “baggage,” I still think GFF3 is superior to its tab-based alternatives. There are a couple of concepts, however, that seemingly are not widely understood and have not been uniformly adopted (based on my experience with different data sources and tools). I have a couple of selfish reasons for posting this thread: first, to clarify my own thoughts on the matter, and second, so that maybe some people will learn something by reading this and make my life easier in the future! But really, a better and broader understanding of these points would benefit the whole genome informatics community.
The first concept that needs work is that of multi-line features. Here is an illustrative example (you may have to scroll over to see the 9th column).
chr8 CpGAT gene 72 5081 . - . ID=chr8.g1;Name=chr8.g1 chr8 CpGAT mRNA 72 5081 . - . ID=chr8.g1.m1;Parent=chr8.g1;index=1;Name=chr8.g1.t1 chr8 CpGAT exon 72 167 . - . ID=chr8.g1.m1.exon1;Parent=chr8.g1.m1 chr8 CpGAT exon 349 522 . - . ID=chr8.g1.m1.exon2;Parent=chr8.g1.m1 chr8 CpGAT exon 611 702 . - . ID=chr8.g1.m1.exon3;Parent=chr8.g1.m1 chr8 CpGAT exon 4916 5081 . - . ID=chr8.g1.m1.exon4;Parent=chr8.g1.m1 chr8 CpGAT CDS 72 167 . - 0 ID=chr8.g1.m1.cds1;Parent=chr8.g1.m1 chr8 CpGAT CDS 349 522 . - 0 ID=chr8.g1.m1.cds2;Parent=chr8.g1.m1 chr8 CpGAT CDS 611 702 . - 2 ID=chr8.g1.m1.cds3;Parent=chr8.g1.m1 chr8 CpGAT CDS 4916 5081 . - 0 ID=chr8.g1.m1.cds4;Parent=chr8.g1.m1
So what’s the matter with this example? Not much really, just that it’s incorrect. It essentially annotates four distinct coding sequences (check the ID attributes) corresponding to a single transcript—not unheard of with prokaryotic polycistrons, but I see this all the time in files annotating eukaryotic genomes. No, this mRNA does not contain four coding sequences, but a single coding sequence consisting of four segments that are discontinuous along the genome. The correct way to encode discontinuous features is to give each corresponding line the same ID attribute, like so.
chr8 CpGAT gene 72 5081 . - . ID=chr8.g1;Name=chr8.g1 chr8 CpGAT mRNA 72 5081 . - . ID=chr8.g1.m1;Parent=chr8.g1;index=1;Name=chr8.g1.t1 chr8 CpGAT exon 72 167 . - . ID=chr8.g1.m1.exon1;Parent=chr8.g1.m1 chr8 CpGAT exon 349 522 . - . ID=chr8.g1.m1.exon2;Parent=chr8.g1.m1 chr8 CpGAT exon 611 702 . - . ID=chr8.g1.m1.exon3;Parent=chr8.g1.m1 chr8 CpGAT exon 4916 5081 . - . ID=chr8.g1.m1.exon4;Parent=chr8.g1.m1 chr8 CpGAT CDS 72 167 . - 0 ID=chr8.g1.m1.cds;Parent=chr8.g1.m1 chr8 CpGAT CDS 349 522 . - 0 ID=chr8.g1.m1.cds;Parent=chr8.g1.m1 chr8 CpGAT CDS 611 702 . - 2 ID=chr8.g1.m1.cds;Parent=chr8.g1.m1 chr8 CpGAT CDS 4916 5081 . - 0 ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
Now, because the four CDS lines share the same attribute, they collectively represent a single feature.
Perhaps the confusion is due in part to when we use the term “feature” to refer to a single line of a GFF3 file—the GFF3 spec does this, and I am sometimes guilty of it as well. While there are certainly some features that can and frequently are represented by a single line, many are not. Coding sequences as just described are one example, but even for mRNAs and genes, a single line describing just the start and end coordinates would be considered incomplete for most analysis purposes. The complete description of a gene feature is, collectively, the line describing it and all of the lines describing its subfeatures. I suggest that a lot of confusion could be avoided in the community if we restricted use of the term “feature” to referencing the complete feature descriptions (however many lines may be involved), and use an alternative term (such as “entry”) to refer to individual lines in a GFF3 file, which may or may not represent a complete feature.
The second concept relates to the assignment of multiple parents to a feature. The GFF3 spec explicitly allows this, although I don’t see it used frequently. By way of example, consider the case of alternative splicing, in which exons that are shared by different isoforms are commonly repeated in the GFF3 once per isoform. While I’m not sure this is necessarily wrong (and many tools support and/or expect this), including each exon once and assigning it to all associated mRNA features is definitely the canonical convention. It would be interesting to see what percentage of bioinformatics tools could handle the canonical case correctly.
What are your experiences working with GFF3? Is there something with regards to GFF3 that you wish everyone else would know or do (or stop doing)?