Out of the many plain text tab-delimited formats for storing genome annotations, the GFF3 format probably has the clearest specification, the least amount ambiguity, and the most flexibility. One of GFF3’s biggest problems, however, is that it builds on older, similar formats that are ambiguous and weakly defined. The result is that many people create and use data files that they think are in GFF3 format, but in reality are some hybrid format.
One thing that always bothered me was how coding sequences are handled. In GFF3 files, a separate line is required for each CDS segment, and yet the feature type “CDS” is almost used (as far as I’ve seen). The GFF3 spec states that feature types should adhere to the Sequence Ontology, and the SO definition for CDS relates to the entire coding sequence, not coding segments that are often interspersed with introns. Shouldn’t they be using “CDS_fragment” to refer to these disjoint regions of the coding sequence?
I reviewed parts of the GFF3 spec in depth yesterday and found something I had not noticed before. I had always considered each separate line in a GFF3 file as a distinct feature. Indeed that is typically the case, but the spec does allow for a single feature to be defined on multiple lines. Any lines that share the same ID attribute correspond to the same feature. This is GFF3’s mechanism for defining disjoint features. So the fact that the “CDS” so term is used is not a problem–assuming that all of the lines corresponding to a single CDS share the same “ID” attribute.
I don’t know how and why I missed this before. The GFF3 spec was revised as recently as 7 months ago, so perhaps this is a recent addition to the official spec. Regardless, this makes much more sense to me now and seems to be much more consistent. I just need to make sure the software I am writing to process GFF3 handles multi-line features correctly.