While a string of As, Cs, Gs, and Ts is a gross simplification of the DNA molecule’s chemical complexity, knowing which nucleotides occur in which sequence is sufficient for many (if not most) applications in genomics and bioinformatics. Additionally, representing DNA sequences as character strings simplifies the process of implementing algorithms for sequence analysis. Thus the Fasta format has become nearly ubiquitous in the life sciences.
There is far less consensus, however, when it comes to formats for genome annotations. This area is dominated mostly by tab-delimited plain text formats such as GFF3, BED, and GTF. These formats are very flexible and relatively simple to parse using command-line tools. Unfortunately, this flexibility comes with several limitations, one of which is the fact that it is not always straightforward to compare two sets of annotations stored in these formats. If genome annotations could be represented by a single character string, however, then comparison and analysis of annotations becomes much more simple.
This is the basic idea behind gene model vectors, a concept I discussed previously. Where the Fasta format uses characters to classify each nucleotide in terms of nitrogenous base content, a gene model vector uses characters to classify each nucleotide in terms of gene structure (i.e., whether the nucleotide is part of an intron, the coding sequence, untranslated region, etc). This representation makes analysis and comparison of gene structure annotations much simpler.
However, there is a limitation to this approach as well. For a locus that contains an alternatively spliced gene or multiple overlapping gene models, any given nucleotide may be associated with multiple transcripts and will thus have more than one annotation. In this case, a single model vector is not sufficient to represent all annotated information at that locus (see my previous post on locus annotation). Creating a single model vector for every single transcript may not make sense though, since non-overlapping transcripts should be represented simultaneously in the same model vector when possible. Basically, we want to represent transcript annotations for the locus using the fewest possible model vectors, packing as much information as we can into each vector without overlapping transcripts.
After a bit of research, I found that this essentially reduces to a graph theory problem. First we treat each transcript annotation as a node in a graph, and then we add an edge between two nodes if they do not overlap. We want to group the transcripts into the smallest number of groups such that no transcripts in a group overlap and all transcripts are part of at least one group. This is equivalent to enumerating all maximal cliques in the transcript graph just described (in graph theory, this is called the maximal clique enumeration problem). If we can find all of the maximal cliques in the graph, then we can represent each clique with a single model vector and the number of cliques has been minimized.
Unfortunately, the maximal clique enumeration problem is NP-complete, meaning that there is no efficient algorithm for solving it. This shouldn’t be too much of a concern, however, since overlapping gene models are pretty rare and the number of splice variants for alternatively spliced genes is typically low. The standard Bron-Kerbosch algorithm, with an exponential runtime in the worst case, is more than sufficient.
When comparing two sets of gene annotations, I use this algorithm in the following workflow.
- First, I determine the position of each locus (see my previous post on locus annotation).
- Next, for each locus, I consider the two sets of gene anotations separately and use the Bron-Kerbosch algorithm to enumerate all maximal transcript cliques for each set.
- Then, I generate a model vector for each maximal clique using the transcript(s) belonging to that clique.
- Finally, I compare each each model vector from the first set with each model vector from the second set to calculate comparison statistics for the locus.