Category: Formats

VAnG: schema-based validation of genome annotations

I am very excited to attend Cold Spring Harbor Lab’s 2013 Genome Informatics conference. I attended in 2011, and it is by far the best meeting I’ve attended as a graduate student. I’m hoping with 2 additional years under my belt, it will be that much more enriching. And with an organizer/session chair lineup that includes 3 of my top pics for potential postdoc advisers (as well as other great scientists I know by reputation), there is a lot of potential for networking!

For a while, my plan has been to use this conference as an opportunity to present my work on annotation validation. I’ve written about this topic previously, and I felt like this would give me the chance to actually implement my ideas.

As it turns out, though, I ended up submitting an abstract instead for mRNAmarkup, a tool our group is developing for quality control and annotation of de novo assembled transcriptomes. However, I spent a lot of time brainstorming the validation work and clarifying my language on the topic, and I would hate for that to go to waste. So here is a rough draft of the abstract I had originally planned to submit.

Analyses of genomic sequences rely extensively on annotation of genomic features such as genes and transposable elements within those sequences. The Sequence Ontology provides a structured controlled vocabulary for describing genomic features, and the GFF3 Specification provides a standardized syntax for encoding those features and their subcomponents as an annotation graph. A common issue with the dissemination and use of genome annotations arises from the fact that different alternative representations exist for the same genomic features, implicitly encoding the same information but utilizing a different subset of ontological terms in its representation. The persistence of alternative formatting conventions highlights two related needs: a mechanism for formally describing an explicit annotation structure, and a mechanism for validating an annotation file against a particular structure description. To address this need we have drawn parallels to XML-related technologies and developed a schema-based approach for validating genome annotations. Plain-text schema files describe representations of annotation graphs in terms of node connectivity, facilitating the validation of annotation data. We present VAnG, a tool for validating genome annotations, and discuss the implications of this tool for the dissemination and consumption of genome annotations.

There are two spots where I feel this draft still needs some work. First, the sentence beginning with “The persistence…” includes some vague language that should be improved. Second, the phrase “facilitating the validation of annotation data” adds very little value to the abstract and should be replaced with something more informative.

I don’t think it will take me too long to implement VAnG once I have some time to dedicate to it. I have already implemented the schema format and corresponding data structures for parsing schemas as part of the AEGeAn Toolkit. When it comes time to publishing this work, I hope this abstract will provide a good starting point to communicating the need for this tool and the benefit it provides.

GFF3 101: multi-line features and multiple parents

Although GFF3 is no doubt the richest and most flexible of the popular tab-delimited text-based genome annotation formats, it unfortunately comes with some baggage. Some of this has to do with the fact that GFF3 looks a lot like a variety of other tab-delimited formats that have more permissive or less flexible formatting rules/conventions (indeed, it was the limitations of these formats that led to the GFF3 specification). Some of this has to do with the fact that most scientists learn what they know about GFF3, at least initially, from examples rather than from the specification, which can be problematic if someone’s first exposure to GFF3 is an incorrect file. Some of this has to do with the fact that while there are several available tools that will validate the syntax of your GFF3 file, there are no generalized tools for checking the content of your GFF3 file (see this previous post for a related discussion).

Despite this “baggage,” I still think GFF3 is superior to its tab-based alternatives. There are a couple of concepts, however, that seemingly are not widely understood and have not been uniformly adopted (based on my experience with different data sources and tools). I have a couple of selfish reasons for posting this thread: first, to clarify my own thoughts on the matter, and second, so that maybe some people will learn something by reading this and make my life easier in the future! But really, a better and broader understanding of these points would benefit the whole genome informatics community.

The first concept that needs work is that of multi-line features. Here is an illustrative example (you may have to scroll over to see the 9th column).

chr8    CpGAT   gene    72      5081    .       -       .       ID=chr8.g1;Name=chr8.g1
chr8    CpGAT   mRNA    72      5081    .       -       .       ID=chr8.g1.m1;Parent=chr8.g1;index=1;Name=chr8.g1.t1
chr8    CpGAT   exon    72      167     .       -       .       ID=chr8.g1.m1.exon1;Parent=chr8.g1.m1
chr8    CpGAT   exon    349     522     .       -       .       ID=chr8.g1.m1.exon2;Parent=chr8.g1.m1
chr8    CpGAT   exon    611     702     .       -       .       ID=chr8.g1.m1.exon3;Parent=chr8.g1.m1
chr8    CpGAT   exon    4916    5081    .       -       .       ID=chr8.g1.m1.exon4;Parent=chr8.g1.m1
chr8    CpGAT   CDS     72      167     .       -       0       ID=chr8.g1.m1.cds1;Parent=chr8.g1.m1
chr8    CpGAT   CDS     349     522     .       -       0       ID=chr8.g1.m1.cds2;Parent=chr8.g1.m1
chr8    CpGAT   CDS     611     702     .       -       2       ID=chr8.g1.m1.cds3;Parent=chr8.g1.m1
chr8    CpGAT   CDS     4916    5081    .       -       0       ID=chr8.g1.m1.cds4;Parent=chr8.g1.m1

So what’s the matter with this example? Not much really, just that it’s incorrect. It essentially annotates four distinct coding sequences (check the ID attributes) corresponding to a single transcript—not unheard of with prokaryotic polycistrons, but I see this all the time in files annotating eukaryotic genomes. No, this mRNA does not contain four coding sequences, but a single coding sequence consisting of four segments that are discontinuous along the genome. The correct way to encode discontinuous features is to give each corresponding line the same ID attribute, like so.

chr8    CpGAT   gene    72      5081    .       -       .       ID=chr8.g1;Name=chr8.g1
chr8    CpGAT   mRNA    72      5081    .       -       .       ID=chr8.g1.m1;Parent=chr8.g1;index=1;Name=chr8.g1.t1
chr8    CpGAT   exon    72      167     .       -       .       ID=chr8.g1.m1.exon1;Parent=chr8.g1.m1
chr8    CpGAT   exon    349     522     .       -       .       ID=chr8.g1.m1.exon2;Parent=chr8.g1.m1
chr8    CpGAT   exon    611     702     .       -       .       ID=chr8.g1.m1.exon3;Parent=chr8.g1.m1
chr8    CpGAT   exon    4916    5081    .       -       .       ID=chr8.g1.m1.exon4;Parent=chr8.g1.m1
chr8    CpGAT   CDS     72      167     .       -       0       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
chr8    CpGAT   CDS     349     522     .       -       0       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
chr8    CpGAT   CDS     611     702     .       -       2       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
chr8    CpGAT   CDS     4916    5081    .       -       0       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1

Now, because the four CDS lines share the same attribute, they collectively represent a single feature.

Perhaps the confusion is due in part to when we use the term “feature” to refer to a single line of a GFF3 file—the GFF3 spec does this, and I am sometimes guilty of it as well. While there are certainly some features that can and frequently are represented by a single line, many are not. Coding sequences as just described are one example, but even for mRNAs and genes, a single line describing just the start and end coordinates would be considered incomplete for most analysis purposes. The complete description of a gene feature is, collectively, the line describing it and all of the lines describing its subfeatures. I suggest that a lot of confusion could be avoided in the community if we restricted use of the term “feature” to referencing the complete feature descriptions (however many lines may be involved), and use an alternative term (such as “entry”) to refer to individual lines in a GFF3 file, which may or may not represent a complete feature.

The second concept relates to the assignment of multiple parents to a feature. The GFF3 spec explicitly allows this, although I don’t see it used frequently. By way of example, consider the case of alternative splicing, in which exons that are shared by different isoforms are commonly repeated in the GFF3 once per isoform. While I’m not sure this is necessarily wrong (and many tools support and/or expect this), including each exon once and assigning it to all associated mRNA features is definitely the canonical convention. It would be interesting to see what percentage of bioinformatics tools could handle the canonical case correctly.

What are your experiences working with GFF3? Is there something with regards to GFF3 that you wish everyone else would know or do (or stop doing)?

Gene model vectors

Genome sequences in Fasta format are strings of As, Cs, Gs, and Ts representing the sequence of nitrogenous bases along the chromosome. I’ve been working on a similar format recently for encoding gene structure along the chromosomes. A ‘G’ represents a nucleotide in an intergenic region, a ‘C’ represents a nucleotide in a coding region within a gene, an ‘I’ represents a nucleotide in an intron, and ‘F’ and ‘T’ represent nucleotides in 5′ and 3′ UTRs of a gene (respectively). Encoding a genomic sequence in this way doesn’t tell you anything about the nitrogenous bases at each position, but if you’re only interested in investigating gene structure, then this format can be quite handy. I’m calling these strings “model vectors” (vectors representing gene models) to differentiate from the standard Fasta format, but standard tools from any bioinformatics libraries (BioPerl, BioPython, etc) shouldn’t have any problem processing data in this format.

For sake of simplicity, I will offer a small, and consequently unrealistic, example. The model vector TTTTTCCIICCCCIIICFFFFFFFF would be annotated something like this in GFF3 format.

##gff-version 3
chr	vim	gene	1	25	.	-	.	ID=g1
chr	vim	mRNA	1	25	.	-	.	ID=g1.t1;Parent=g1
chr	vim	three_prime_UTR	1	5	.	-	.	ID=g1.t1.utr1;Parent=g1.t1
chr	vim	CDS	6	7	.	-	.	ID=g1.t1.cds1;Parent=g1.t1
chr	vim	CDS	10	13	.	-	.	ID=g1.t1.cds2;Parent=g1.t1
chr	vim	CDS	17	17	.	-	.	ID=g1.t1.cds3;Parent=g1.t1
chr	vim	five_prime_UTR	18	25	.	-	.	ID=g1.t1.utr2;Parent=g1.t1