Category: Annotation

VAnG: schema-based validation of genome annotations

I am very excited to attend Cold Spring Harbor Lab’s 2013 Genome Informatics conference. I attended in 2011, and it is by far the best meeting I’ve attended as a graduate student. I’m hoping with 2 additional years under my belt, it will be that much more enriching. And with an organizer/session chair lineup that includes 3 of my top pics for potential postdoc advisers (as well as other great scientists I know by reputation), there is a lot of potential for networking!

For a while, my plan has been to use this conference as an opportunity to present my work on annotation validation. I’ve written about this topic previously, and I felt like this would give me the chance to actually implement my ideas.

As it turns out, though, I ended up submitting an abstract instead for mRNAmarkup, a tool our group is developing for quality control and annotation of de novo assembled transcriptomes. However, I spent a lot of time brainstorming the validation work and clarifying my language on the topic, and I would hate for that to go to waste. So here is a rough draft of the abstract I had originally planned to submit.

Analyses of genomic sequences rely extensively on annotation of genomic features such as genes and transposable elements within those sequences. The Sequence Ontology provides a structured controlled vocabulary for describing genomic features, and the GFF3 Specification provides a standardized syntax for encoding those features and their subcomponents as an annotation graph. A common issue with the dissemination and use of genome annotations arises from the fact that different alternative representations exist for the same genomic features, implicitly encoding the same information but utilizing a different subset of ontological terms in its representation. The persistence of alternative formatting conventions highlights two related needs: a mechanism for formally describing an explicit annotation structure, and a mechanism for validating an annotation file against a particular structure description. To address this need we have drawn parallels to XML-related technologies and developed a schema-based approach for validating genome annotations. Plain-text schema files describe representations of annotation graphs in terms of node connectivity, facilitating the validation of annotation data. We present VAnG, a tool for validating genome annotations, and discuss the implications of this tool for the dissemination and consumption of genome annotations.

There are two spots where I feel this draft still needs some work. First, the sentence beginning with “The persistence…” includes some vague language that should be improved. Second, the phrase “facilitating the validation of annotation data” adds very little value to the abstract and should be replaced with something more informative.

I don’t think it will take me too long to implement VAnG once I have some time to dedicate to it. I have already implemented the schema format and corresponding data structures for parsing schemas as part of the AEGeAn Toolkit. When it comes time to publishing this work, I hope this abstract will provide a good starting point to communicating the need for this tool and the benefit it provides.

GFF3 101: multi-line features and multiple parents

Although GFF3 is no doubt the richest and most flexible of the popular tab-delimited text-based genome annotation formats, it unfortunately comes with some baggage. Some of this has to do with the fact that GFF3 looks a lot like a variety of other tab-delimited formats that have more permissive or less flexible formatting rules/conventions (indeed, it was the limitations of these formats that led to the GFF3 specification). Some of this has to do with the fact that most scientists learn what they know about GFF3, at least initially, from examples rather than from the specification, which can be problematic if someone’s first exposure to GFF3 is an incorrect file. Some of this has to do with the fact that while there are several available tools that will validate the syntax of your GFF3 file, there are no generalized tools for checking the content of your GFF3 file (see this previous post for a related discussion).

Despite this “baggage,” I still think GFF3 is superior to its tab-based alternatives. There are a couple of concepts, however, that seemingly are not widely understood and have not been uniformly adopted (based on my experience with different data sources and tools). I have a couple of selfish reasons for posting this thread: first, to clarify my own thoughts on the matter, and second, so that maybe some people will learn something by reading this and make my life easier in the future! But really, a better and broader understanding of these points would benefit the whole genome informatics community.

The first concept that needs work is that of multi-line features. Here is an illustrative example (you may have to scroll over to see the 9th column).

chr8    CpGAT   gene    72      5081    .       -       .       ID=chr8.g1;Name=chr8.g1
chr8    CpGAT   mRNA    72      5081    .       -       .       ID=chr8.g1.m1;Parent=chr8.g1;index=1;Name=chr8.g1.t1
chr8    CpGAT   exon    72      167     .       -       .       ID=chr8.g1.m1.exon1;Parent=chr8.g1.m1
chr8    CpGAT   exon    349     522     .       -       .       ID=chr8.g1.m1.exon2;Parent=chr8.g1.m1
chr8    CpGAT   exon    611     702     .       -       .       ID=chr8.g1.m1.exon3;Parent=chr8.g1.m1
chr8    CpGAT   exon    4916    5081    .       -       .       ID=chr8.g1.m1.exon4;Parent=chr8.g1.m1
chr8    CpGAT   CDS     72      167     .       -       0       ID=chr8.g1.m1.cds1;Parent=chr8.g1.m1
chr8    CpGAT   CDS     349     522     .       -       0       ID=chr8.g1.m1.cds2;Parent=chr8.g1.m1
chr8    CpGAT   CDS     611     702     .       -       2       ID=chr8.g1.m1.cds3;Parent=chr8.g1.m1
chr8    CpGAT   CDS     4916    5081    .       -       0       ID=chr8.g1.m1.cds4;Parent=chr8.g1.m1

So what’s the matter with this example? Not much really, just that it’s incorrect. It essentially annotates four distinct coding sequences (check the ID attributes) corresponding to a single transcript—not unheard of with prokaryotic polycistrons, but I see this all the time in files annotating eukaryotic genomes. No, this mRNA does not contain four coding sequences, but a single coding sequence consisting of four segments that are discontinuous along the genome. The correct way to encode discontinuous features is to give each corresponding line the same ID attribute, like so.

chr8    CpGAT   gene    72      5081    .       -       .       ID=chr8.g1;Name=chr8.g1
chr8    CpGAT   mRNA    72      5081    .       -       .       ID=chr8.g1.m1;Parent=chr8.g1;index=1;Name=chr8.g1.t1
chr8    CpGAT   exon    72      167     .       -       .       ID=chr8.g1.m1.exon1;Parent=chr8.g1.m1
chr8    CpGAT   exon    349     522     .       -       .       ID=chr8.g1.m1.exon2;Parent=chr8.g1.m1
chr8    CpGAT   exon    611     702     .       -       .       ID=chr8.g1.m1.exon3;Parent=chr8.g1.m1
chr8    CpGAT   exon    4916    5081    .       -       .       ID=chr8.g1.m1.exon4;Parent=chr8.g1.m1
chr8    CpGAT   CDS     72      167     .       -       0       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
chr8    CpGAT   CDS     349     522     .       -       0       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
chr8    CpGAT   CDS     611     702     .       -       2       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1
chr8    CpGAT   CDS     4916    5081    .       -       0       ID=chr8.g1.m1.cds;Parent=chr8.g1.m1

Now, because the four CDS lines share the same attribute, they collectively represent a single feature.

Perhaps the confusion is due in part to when we use the term “feature” to refer to a single line of a GFF3 file—the GFF3 spec does this, and I am sometimes guilty of it as well. While there are certainly some features that can and frequently are represented by a single line, many are not. Coding sequences as just described are one example, but even for mRNAs and genes, a single line describing just the start and end coordinates would be considered incomplete for most analysis purposes. The complete description of a gene feature is, collectively, the line describing it and all of the lines describing its subfeatures. I suggest that a lot of confusion could be avoided in the community if we restricted use of the term “feature” to referencing the complete feature descriptions (however many lines may be involved), and use an alternative term (such as “entry”) to refer to individual lines in a GFF3 file, which may or may not represent a complete feature.

The second concept relates to the assignment of multiple parents to a feature. The GFF3 spec explicitly allows this, although I don’t see it used frequently. By way of example, consider the case of alternative splicing, in which exons that are shared by different isoforms are commonly repeated in the GFF3 once per isoform. While I’m not sure this is necessarily wrong (and many tools support and/or expect this), including each exon once and assigning it to all associated mRNA features is definitely the canonical convention. It would be interesting to see what percentage of bioinformatics tools could handle the canonical case correctly.

What are your experiences working with GFF3? Is there something with regards to GFF3 that you wish everyone else would know or do (or stop doing)?

Validating sequence annotations

Anyone possessing more than a passing familiarity with gene annotation will understand the frustration I frequently experience when trying to analyze or compare GFF3 files obtained from multiple sources. While the syntactical rules established by the GFF3 specification are clear and consistently observed, there is an enormous amount of flexibility possible when deciding which exact terms and relationships to utilize when formatting data with GFF3. For example, consider the following gene structure annotation.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   exon    22053   22550   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22167   22550   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22651   23022   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   exon    22651   23448   .       +       .       Parent=chr8.g3.t1

This annotation makes it clear that we have a gene with a single transcription product. There are 2 exons, and although UTRs are not explicitly defined, they can be inferred from the exonic coordinates that do not overlap with the specified coding sequence. Now consider an alternative representation.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   five_prime_UTR  22053   22166   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22167   22550   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   CDS     22651   23022   .       +       0       Parent=chr8.g3.t1
chr8    CpGAT   three_prime_UTR 23023   23448   .       +       .       Parent=chr8.g3.t1

In this representation, the UTRs have been explicitly defined, and although the exons have not, their coordinates can be inferred from the UTRs and coding sequence. Of course, there are additional alternative representations.

chr8    CpGAT   gene    22053   23448   .       +       .       ID=chr8.g3;Name=chr8.g3
chr8    CpGAT   mRNA    22053   23448   .       +       .       ID=chr8.g3.t1;Parent=chr8.g3;index=1;Name=chr8.g3.t1
chr8    CpGAT   exon    22053   22550   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   start_codon     22167   22169   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   exon    22651   23448   .       +       .       Parent=chr8.g3.t1
chr8    CpGAT   stop_codon      23020   23022   .       +       .       Parent=chr8.g3.t1

So which of these representations is correct? As far as the GFF3 spec is concerned, they all are! Depending on the exact data you are interested in extracting, one of the representations may be more convenient than the others, but they all encode the same information.

One benefit of the GFF3 format is that it leverages the Sequence Ontology, which provides a very detailed and comprehensive description of entities within the realm of biological sequences and the semantics of the relationships between these entities. For example, the relationship between an mRNA and its associated coding sequence is captured in the SO by three terms (nodes) and two relationships (edges): CDS is_a mRNA_region and mRNA_region part_of mRNA. In the GFF3 samples above, the association of each mRNA feature with its corresponding CDS feature is valid, as the graph representing the ontology includes a path from the term CDS to the term mRNA.

What seems to be lacking here is a mechanism for enforcing additional constraints on terms and relationships for specific contexts or use cases. To continue with the previous example, the SO can validate the relationship between CDS and mRNA, but it is completely silent as to whether an mRNA feature is valid without also explicitly defining its corresponding coding sequence, or whether a CDS feature is valid on its own without associating it with an mRNA feature. While I think the ability to encode so many types of data using GFF3+SO is one of its strengths, the lack of a mechanism for this additional layer of validation is a weakness, and the cause of a lot of the frustrations I’m documenting here.

In a previous life I worked quite extensively with XML data, and I can draw some pretty useful parallels from that experience to my experience with gene annotation data. XML can be used to store data for anything from financial records to cooking recipes to multimedia metadata (and genome annotations of course!). As one might expect, however, an XML document containing financial records will by necessity look quite different from one used to store recipes. The XML specification dictates the syntax that XML files must use, but there is an infinite amount of flexibility with regards to the data types used and the structure and organization of the data.

Declaring an XML document “correct” requires two things: verifying that the data are well-formed, and verifying that the data are valid. Verifying that an XML document is well-formed is as simple as checking it against the official XML specification and ensuring that it obeys all of the syntax rules defined therein. Indeed, most software used to process XML data have well-formedness check built into the XML parser and will fail gracefully if the document is not well formed. However, verifying that an XML file is valid requires checking it against a schema that defines which data types are valid and how the data should be structured. XML schemas are very context-specific and cannot be found in the main specification. Rather, they are typically produced by individual communities of practice for use within that community. Using this mechanism, bank IT personnel do not need to worry about what makes an XML-encoded recipe valid, and recipe bloggers do not need to worry about what makes XML-encoded financial records valid.

A mechanism analogous to validating XML files via a schema, applied to sequence annotations, could mitigate a lot of the difficulty and frustration associated with sharing annotation data. The flexibility provided by GFF3 is definitely a good thing–surely it’s better than coming up with new half-baked formats for each different type of data as we continue to find new useful ways to annotate sequences. But there needs to be a way to formally place additional constraints on a GFF3 file for use in a particular context. Rather than trying to anticipate every input contingency under the sun (which has recently become my approach), developers of analysis tools could instead provide a schema with which a user could validate their input data. Scientists interested only in TE annotations could safely ignore the formatting conventions that others are using to annotate protein-coding genes or epigenetic marks or a variety of other data types. Of course, solving the technical issue of developing such a validation scheme (or leveraging an existing one) is quite different from achieving wide adoption within the community, but until we address this issue we can continue to expect the same types of inconsistency and frustration with annotation data.

Command-line magic for your gene annotations

In the GFF3 specification, Lincoln Stein claims that the primary reason tab-delimited annotation formats have persisted for so many years (despite the existence a variety of worthy alternative formats) is the ease with which tab-delimited data can be edited and processed on the command line. In this post I wanted to see what kinds of information are easily extracted from a GFF3 file using basic command line tools (i.e. no reusable parsers as provided by BioPerl, BioPython, et al).

Determine all annotated sequences

The GFF3 spec includes the ##sequence-region pragma whose purpose is to specify boundary coordinates for each annotated sequence in the file. However, I see this pragma neglected more often than not, both by data files and by tools. Plus, there is no strict requirement that the file must include features corresponding to each ##sequence-region entry. So if you want to determine which sequences for which a given GFF3 really provides annotations, it’s best to look instead at the first column of each feature.

Here is a command that will print out all of the sequences annotated by a given GFF3 file. A detailed explanation is provided below.

cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq -c | sort -rn | head
  • The first cut command will attempt to extract and print the 1st and 9th column of each line in the file. We ultimately don’t need the 9th column, but it’s helpful at this point since it allows us to distinguish our entries of interest (features) from other entries (directives, pragmas, comments, and sequence data). This command will output two fields for each feature, while it will only output a single field for all the other entry types. This enables filtering out of non-features with subsequent commands.
  • The grep command will check each line of the previous command to see whether it contains a tab character (separating two fields). Only lines that contain a tab are printed. The output of this command is all of the features from the GFF3 file.
  • The second cut command will cut out the first field from each feature and ignore the other, since we weren’t really interested in it in the first place. The output of this command is the sequence corresponding to each feature.
  • The first sort command will (surprise!) sort the output of the previous command.
  • The uniq command will collapse identical adjacent lines and print out the number of times each line is seen. The output of this command is what we’re looking for and really we could stop here. The next two commands are simply for convenience.
  • The second sort command will sort the sequence IDs according to the number of corresponding annotated features.
  • The head command will simply ensure that you only see the first 10 lines of output (useful if your GFF3 file includes thousands of sequences, such as scaffolds from a draft assembly).

Determine annotation types

Perhaps the most common use for the GFF3 format is encoding protein-coding gene structures. However, the format is very flexible and can leverage the Sequence Ontology to provide annotations for just about anything you could imagine related to biological sequences. So when working with annotations from an external source, it’s always a good idea to verify your assumptions about precisely what type of annotations are provided.

Here is a command that will print out all of the feature types contained in a given GFF3 file, along with a detailed explanation.

grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq -c | sort -rn | head
  • We’re interested only in features, so the grep command will print out all lines that do not start with a pound symbol.
  • The cut command will cut out the third column of each feature and print it. This command will automatically ignore any unwanted data that may not have been filtered out by the previous command (such as sequence data), since those data do not contain tabs.
  • The purpose of the remaining commands is identical to the previous example: sort the output, collapse adjacent lines that are identical, and print out each line and the number of times it occurs. Again the last two command are provided only for convenience so that your terminal is not flooded with data.

Determine the number of genes

The previous command gives the number of occurrences of each feature type in the GFF3 file. However, if you are only interested in the number of occurrences of a single feature type (say, genes), then there is a much simpler solution.

grep -c $'\tgene\t' yourannots.gff3

Replace gene with your desired feature type and that’s that!

Extract all gene IDs

The previous example can easily be extended to pull out all the IDs for a give feature type. Here is a command that will print out all gene IDS for a given GFF3 file, with a corresponding explanation. This command is easily extended to other feature types.

grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)' | head
  • As in the previous example, the grep command will print only lines that contain gene features.
  • The Perl one-liner uses a regular expression to identify the ID attribute and print it.
  • The head command is only for convenience and simply makes sure you don’t flood your terminal with IDs.

Print length of each gene

The previous examples can also be extended quite easily to print out the length of all features of a given type. This is very helpful if you want to analyze and/or visualize a distribution of feature lengths. Here is a command that will print out the length of each gene in a given GFF3 file. Again, this is easily adapted to other feature types.

grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | /
                                   perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)' | /
                                   sort -rn | head
  • As in the previous example, the grep command will print only lines that contain gene features.
  • The cut command will extract and print the 4th and 5th field of each feature, which corresponds to its start and end coordinates.
  • The Perl one-liner will print out the length of each feature given the coordinates.
  • As in previous examples, the remaining commands are solely for aesthetic convenience.

This command would typically be entered on a single line in your terminal. I only broke it up into multiple lines for readability.

Example

I downloaded gene annotations from TAIR to verify that these commands work as expected. Below is my terminal output.

standage@iMint ~ $ wget ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
--2012-08-10 15:16:45--  ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
           => `TAIR10_GFF3_genes.gff'
Resolving ftp.arabidopsis.org (ftp.arabidopsis.org)... 171.66.71.56
Connecting to ftp.arabidopsis.org (ftp.arabidopsis.org)|171.66.71.56|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Genes/TAIR10_genome_release/TAIR10_gff3 ... done.
==> SIZE TAIR10_GFF3_genes.gff ... 44139005
==> PASV ... done.    ==> RETR TAIR10_GFF3_genes.gff ... done.
Length: 44139005 (42M) (unauthoritative)

100%[=====================================================================================================================>] 44,139,005  6.67M/s   in 7.2s    

2012-08-10 15:16:54 (5.87 MB/s) - `TAIR10_GFF3_genes.gff' saved [44139005]

standage@iMint ~ $ cut -s -f 1,9 TAIR10_GFF3_genes.gff | grep $'\t' | cut -f 1 | sort | uniq -c | sort -rn | head
 157712 Chr1
 135017 Chr5
 113968 Chr3
  91857 Chr2
  90371 Chr4
    723 ChrM
    616 ChrC
standage@iMint ~ $ grep -v '^#' TAIR10_GFF3_genes.gff | cut -s -f 3 | sort | uniq -c | sort -rn | head
 215909 exon
 197160 CDS
  35386 protein
  35386 mRNA
  34621 five_prime_UTR
  30634 three_prime_UTR
  28775 gene
   3911 mRNA_TE_gene
   3903 transposable_element_gene
   1274 pseudogenic_exon
standage@iMint ~ $ grep -c $'\tgene\t' TAIR10_GFF3_genes.gff
28775
standage@iMint ~ $ grep $'\tgene\t' TAIR10_GFF3_genes.gff | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)' | head
AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01046
AT1G01050
AT1G01060
AT1G01070
AT1G01073
AT1G01080
standage@iMint ~ $ grep $'\tgene\t' TAIR10_GFF3_genes.gff | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)' | sort -rn | head
31258
26435
25965
23544
19753
19352
18492
18184
17943
17555
standage@iMint ~ $

Training the SNAP ab initio gene predictor

I’m in the process of annotating the genome of a non-model insect species (using the Maker annotation pipeline). Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). I may try this approach some time, but intuitively the models should be more accurate if they are trained on a training set in which we have a high level of confidence. Luckily, my advisor has been working on identifying genes conserved across insects (including my species of interest), so he was able to furnish a high-quality data set for training the gene finders: a GFF3 file containing spliced alignments of conserved gene models against the genome of interest (for which I already had the Fasta file).

To train the SNAP gene predictor, I followed the procedure briefly described in the software’s README file. However, this did require me to pre-process the data a bit–and there were a few gotchas along the way. This post is a record of the process.

Converting from GFF3 to ZFF

SNAP uses a non-standard format dubbed ZFF (also described briefly in the README) for model training. I wrote the following script (gff3_to_zff.pl) to convert the gene models my advisor had given me in GFF3 format to the ZFF format.

#!/usr/bin/env perl
use strict;

my @exons;
my $gene_count = 0;
my $current_seq = "";
while(my $line = <STDIN>)
{
  if($line =~ m/^###/)
  {
    flush(\@exons);
    next;
  }
  my @fields = split(/\t/, $line);
  if($fields[2] eq "mRNA")
  {
    flush(\@exons);
  }
  elsif($fields[2] eq "exon")
  {
    if ($fields[0] ne $current_seq)
    {
      $current_seq = $fields[0];
      printf(">%s\n", $current_seq);
    }
    push(@exons, \@fields);
  }
}
flush();

sub flush
{
  my $num_exons = scalar(@exons);
  return if($num_exons == 0);
  
  my $group = sprintf("%s.%d", $exons[0]->[0], $gene_count);
  $gene_count++;
  
  if($num_exons == 1)
  {
    my($start, $end) = ($exons[0]->[3], $exons[0]->[4]);
    if($exons[0]->[6] eq "-")
    {
      ($start, $end) = ($exons[0]->[4], $exons[0]->[3]);
    }
    printf("Esngl\t%lu\t%lu\t%s\n", $start, $end, $group);
  }
  else
  {
    @exons = reverse(@exons) if($exons[0]->[6] eq "-");
    for(my $i = 0; $i < $num_exons; $i++)
    {
      my $exon_type = "Exon";
      if($i == 0)
      {
        $exon_type = "Einit";
      }
      elsif($i == $num_exons - 1)
      {
        $exon_type = "Eterm";
      }
      
      my($start, $end) = ($exons[$i]->[3], $exons[$i]->[4]);
      if($exons[0]->[6] eq "-")
      {
        ($start, $end) = ($exons[$i]->[4], $exons[$i]->[3]);
      }
      
      printf("%s\t%lu\t%lu\t%s\n", $exon_type, $start, $end, $group);
    }
  }
  @exons = ();
}

Trimming and sorting the genome sequences

With the gene models in ZFF format and the genome in Fasta format, I was ready to begin the training procedure…or so I thought. I was getting some strange unexpected errors, and after a bit of digging it appeared that the errors were due to sorting differences in the two files. It looks like the SNAP training program expects all of the gene annotations corresponding to the first sequence in the genome file to come first, and then follow the same order throughout the rest of the file. I hadn’t realized before, but the sequences in the genome Fasta file were sorted numerically (scaffold_1, scaffold_2, scaffold_3, etc) whereas the gene models in the annotation file were sorted lexicographically by sequence (scaffold_1, scaffold_10, scaffold_100, etc). I wrote a quick Perl script to sort the Fasta file, but still had problems. This time, the issue was that the order was still not correct since the genome contained some sequences for which no gene models were provided in the annotation file. To preserve the same order in both files, I had to remove these sequences from the genome file. The following script (fasta_sort.pl) took care of both the sorting and filtering.

#!/usr/bin/env perl
use strict;
use Bio::SeqIO;

my $filterfile = $ARGV[0];
my $seqs_to_keep = {};
open(my $FILTER, "<", $filterfile) or die("Error: could not open '$filterfile' $!");
while(<$FILTER>)
{
  chomp();
  $seqs_to_keep->{$_} = 1;
}
close($FILTER);

my $seqs = {};
my $loader = Bio::SeqIO->new(-fh => \*STDIN, -format => 'Fasta');
while(my $seq = $loader->next_seq)
{
  $seqs->{$seq->id} = $seq;
}

my @keys = sort(keys(%$seqs));
my $writer = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta');
foreach my $seqid(@keys)
{
  $writer->write_seq($seqs->{$seqid}) if($seqs_to_keep->{$seqid});
}

Training

At this point, training proceeded as expected and I was able to train a SNAP model I can use for my genome of interest. The entire procedure is documented below.

perl gff3_to_zff.pl < myinsect-conserved-genes.gff3 > myinsect.ann
grep '^>' myinsect.ann | tr -d '>' > myinsect.seqs2keep
perl fasta_sort.pl myinsect.seqs2keep < myinsect-genome.fasta > myinsect.dna
fathom myinsect.ann myinsect.dna -gene-stats > gene-stats.log 2>&1
fathom myinsect.ann myinsect.dna -validate > validate.log 2>&1
fathom myinsect.ann myinsect.dna -categorize 1000 > categorize.log 2>&1
fathom uni.ann uni.dna -export 1000 -plus > uni-plus.log 2>&1
mkdir params
cd params
forge ../export.ann ../export.dna > ../forge.log 2>&1
cd ..
hmm-assembler.pl myinsect params/ > myinsect.hmm

Color coding for custom anonymous user tracks in GBrowse2

GBrowse2 provides a pretty nice feature that enables anonymous users to upload their own data and set up a custom, private track for visualization in the context of all the permanent, official tracks provided by the host. A wide variety of configuration options are available to customize the look, feel, and behavior of each individual data track. If you have administrative access to the genome browser, you can even adding Perl callback functions to configuration settings to enable dynamic calculation of each genomic feature’s color, shape (glyph), label, hover text, and click action. This is a powerful and flexible approach that enables a high level of customization and integration.

This week, I wanted to add feature-specific color-coding to my custom data track. My first attempt was to add callback functions to the fgcolor and bgcolor configuration settings to calculate the feature’s color based on a discriminative GFF3 attribute. Unfortunately, scriptable configurations are not available for anonymous user tracks. This is undoubtedly for good reason (since allowing these would introduce significant security vulnerabilities), but this means I had to figure out another approach to color-coding.

The GBrowse mailing list suggested I encode a discriminative value in the “score” column of the (GFF3) data file and then use the graded_segments glyph for rendering. This did indeed provide some color coding. However, the graded_segments glyph only allows you to specify one base color, and modulating a feature’s score will only change the intensity or fill of that color, not the hue itself. I decided to look around a bit more, and discovered the heat_map glyph. This glyph allows you to specify two colors. Then, the color for each feature is determined by the mapping of that feature’s score to the continuous spectrum between the two specified colors. I liked this approach much better, as it gave me a much larger palette with which to work.

For my final solution, I ended up setting each feature’s score to a discrete value between 0 and 1 (inclusive), and then adding the following settings to my track configuration.

glyph       = heat_map
min_score   = 0
max_score   = 1
start_color = red
end_color   = green

This turned out quite nicely. The value 0 maps to red, 0.25 maps to orange, 0.5 maps to yellow, 0.75 maps to a light green, and 1.0 maps to a darker true green.

Converting Entrez XML to GFF3

This last week represents the second time I have tried to access gene annotations for a particular genome of interest, but quickly lost interest since the data were only available in complicated, extremely verbose, and poorly documented XML or ASN.1 formats. Luckily, the first time this happened (a few months ago), I tried hopping on NCBI’s FTP site and was able to find annotations in a GFF-like tab-delimited format, which I was able to easily convert to GFF3. This last week, however, NCBI’s FTP site (and their help desk, for that matter) were no help for finding usable gene annotations in a tab-delimited format.

I finally decided to buck up, bite the bullet, and write a conversion script myself (if these conversion scripts counted toward advancement, I’d have tenure by now). Most of my experience with XML comes from my web development days where I primarily used PHP’s SimpleXML library for parsing and processing XML data. I’m sure Perl and C (and probably all the other common languages) have XML-processing libraries that are just fine, but I decided to implement this script in PHP so that I could work in a familiar environment and complete the task as quickly as possible so I can get back to my research.

Anyway, below is the latest draft of the conversion script I implemented.

#!/usr/bin/env php
<?php

/*
This program takes a single argument on the command line: the path of a file containing Entrez XML-formatted gene annotations.
GFF3-formatted annotations are printed to STDOUT.
*/

ini_set("memory_limit", -1);
assert_options(ASSERT_BAIL, false);
assert_options(ASSERT_WARNING, false);
$strands = array("plus" => "+", "minus" => "-");
$encode_search = array(';', '=', '%', '&', ',');
$encode_replace = array('%3B', '%3D', '%25', '%26', '%2C');

$xmlfile = $argv[1];
$xmldata = simplexml_load_file($xmlfile);

$genes = $xmldata->xpath('/Entrezgene-Set/Entrezgene');

function assertordie($condition, $message)
{
  assert($condition) or fprintf(STDERR, "Assert error: $message\n") and die();
}

foreach($genes as $gene)
{
  // Gene feature data
  $gene_ui   = $gene->{'Entrezgene_gene-source'}->{'Gene-source'}->{'Gene-source_src-int'};
  $gene_acc  = $gene->{'Entrezgene_gene'}->{'Gene-ref'}->{'Gene-ref_locus'};
  $gene_acc = str_replace($encode_search, $encode_replace, $gene_acc);
  $gene_desc = $gene->{'Entrezgene_gene'}->{'Gene-ref'}->{'Gene-ref_desc'};
  $gene_desc = str_replace($encode_search, $encode_replace, $gene_desc);
  $gene_comm = $gene->xpath('Entrezgene_locus/Gene-commentary');
  $gene_comm_count = 0;
  if(sizeof($gene_comm) > 1)
  {
    fprintf(STDERR, "Warning: assuming that locus '%s (%s)' contains %d genes\n", $gene_acc, $gene_ui, sizeof($gene_comm));
  }

  foreach($gene_comm as $comm)
  {
    $gene_comm_count++;
    $comm_ui = $gene_ui;
    if(sizeof($gene_comm) > 1)
    {
      $comm_ui = sprintf("%s.g%d", $gene_ui, $gene_comm_count);
    }
    $gene_seq = $comm->{'Gene-commentary_accession'};
    $gene_intervals = $comm->xpath('Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval');
    assertordie(sizeof($gene_intervals) == 0 or sizeof($gene_intervals) == 1, sprintf("number of intervals for gene '%s (%s)': expected=%s, actual=%d", $gene_acc, $comm_ui, "[0,1]", sizeof($gene_intervals)));
    if(sizeof($gene_intervals) == 0)
    {
      fprintf(STDERR, "Warning: gene '%s (%s)' contains no genomic intervals, assuming it's a deprecated gene, skipping\n", $gene_acc, $comm_ui);
      continue;
    }

    $gene_products = $comm->xpath('Gene-commentary_products/Gene-commentary[Gene-commentary_type/@value="mRNA"]');
    if(sizeof($gene_products) < 1)
    {
      fprintf(STDERR, "Warning: gene '%s (%s)' contains no mRNA products, assuming it's a duplicates gene, skipping\n", $gene_acc, $comm_ui);
      continue;
    }
    
    $strand_attributes = $gene_intervals[0]->{"Seq-interval_strand"}->{"Na-strand"}->attributes();
    $gstrand = $strands[(string)$strand_attributes["value"]];
    $gattributes = sprintf('ID=%s;Name=%s;Note="%s"', $comm_ui, $gene_acc, $gene_desc);
    $gstart = (int)$gene_intervals[0]->{"Seq-interval_from"} + 1;
    $gend   = (int)$gene_intervals[0]->{"Seq-interval_to"} + 1;
    printf("%s\t%s\tgene\t%d\t%d\t.\t%s\t.\t%s\n", $gene_seq, "Entrez", $gstart, $gend, $gstrand, $gattributes);

    // mRNA feature data
    foreach($gene_products as $mrna)
    {
      // mRNA and exon features
      $tui = $mrna->{"Gene-commentary_accession"};
      $exons = $mrna->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_mix/Seq-loc-mix/Seq-loc/Seq-loc_int/Seq-interval');
      if(sizeof($exons) == 0)
      {
        $exons = $mrna->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_int/Seq-interval');
        assertordie(sizeof($exons ==1), sprintf("number of exons for transcript '%s': expected=%d, actual=%d", $tui, 1, sizeof($exons)));
      }
      $tattributes = sprintf('ID=%s;Parent=%s', $tui, $comm_ui);

      $tcoords = array();
      foreach($exons as $exon)
      {
        $tcoords[] = $exon->{"Seq-interval_from"};
        $tcoords[] = $exon->{"Seq-interval_to"};
      }
      $tstart = min($tcoords) + 1;
      $tend   = max($tcoords) + 1;

      // protein and CDS features
      $transcript_products = $mrna->xpath('Gene-commentary_products/Gene-commentary[Gene-commentary_type/@value="peptide"]');
      assertordie(sizeof($transcript_products == 1), sprintf("number of products for transcript '%s': expected=%d, actual=%d", $tui, 1, sizeof($transcript_products)));
      $protein = $transcript_products[0];
      $pui = $protein->{"Gene-commentary_accession"};
      $cds_segments = $protein->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_mix/Seq-loc-mix/Seq-loc/Seq-loc_int/Seq-interval');
      if(sizeof($cds_segments) == 0)
      {
        $cds_segments = $protein->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_int/Seq-interval');      
      }
      $pcoords = array();
      foreach($cds_segments as $cds)
      {
        $pcoords[] = $cds->{"Seq-interval_from"};
        $pcoords[] = $cds->{"Seq-interval_to"};
      }
      $pstart = min($pcoords) + 1;
      $pend   = max($pcoords) + 1;
      $pattributes = sprintf("ID=%s;Parent=%s", $pui, $tui);

      // Print out all features
      printf("%s\t%s\tmRNA\t%d\t%d\t.\t%s\t.\t%s\n", $gene_seq, "Entrez", $tstart, $tend, $gstrand, $tattributes);
      printf("%s\t%s\tprotein\t%d\t%d\t.\t%s\t.\t%s\n", $gene_seq, "Entrez", $pstart, $pend, $gstrand, $pattributes);
      foreach($exons as $exon)
      {
        $estart = (int)$exon->{"Seq-interval_from"} + 1;
        $eend   = (int)$exon->{"Seq-interval_to"} + 1;
        printf("%s\t%s\texon\t%d\t%d\t.\t%s\t.\tParent=%s\n", $gene_seq, "Entrez", $estart, $eend, $gstrand, $tui);
      }
      foreach($cds_segments as $cds)
      {
        $cstart = (int)$cds->{"Seq-interval_from"} + 1;
        $cend   = (int)$cds->{"Seq-interval_to"} + 1;
        printf("%s\t%s\tCDS\t%d\t%d\t.\t%s\t.\tParent=%s\n", $gene_seq, "Entrez", $cstart, $cend, $gstrand, $tui);
      }
    }
  }
}
?>

GFF3 syntax highlighting for nano

Proficiency with a command-line text editor is a valuable skill for a biologist. Computers that have enough resources to tackle large-scale bioinformatics problems often don’t have a graphical interface–the only way to create and modify files is to use an editor like vim, emacs, or nano. Programmers have very strong feelings about which is the best, but it doesn’t really make a difference to the average biologist. The nano editor is probably the easiest for a beginner to learn.

When using any kind of text editor, I find that syntax highlighting is an extremely useful feature. It color-codes the text according to its function in the code, making it easy to browse through the code, understand what each statement does, and quickly identify errors. The text editors I mentioned above support syntax highlighting for common programming languages.

Using custom nano configuration files (.nanorc files), you can define a custom syntax highlighting scheme for a given file type. Since I do a lot of work with gene annotations, I created a custom .nanorc that defines syntax highlighting for GFF3 files. GFF3 files contain data (not code), but the syntax highlighting can still be useful. The color coding is still pretty simple at this point, but adds another dimension when viewing the file.

Here is the .nanorc file (to be placed in your home directory)…

## GFF3 files
##
syntax "gff3" "\.gff3$"

## Directives
color brightblue "^##[a-z].*"

## Resolution symbol
color red,white "^###$"

## Reserved attributes
color brightblack "(ID|Name|Alias|Parent|Target|Gap|Derives_from|Note|Dbxref|Ontology_term|Is_circular)"

## Comments
color white "^#[^#].*"

…and here is an example GFF3 file, with and without highlighting.

GFF3 file without syntax highlighting

GFF3 file without syntax highlighting

GFF3 file with syntax highlighting

GFF3 file with syntax highlighting

Multi-line features in GFF3

Out of the many plain text tab-delimited formats for storing genome annotations, the GFF3 format probably has the clearest specification, the least amount ambiguity, and the most flexibility. One of GFF3’s biggest problems, however, is that it builds on older, similar formats that are ambiguous and weakly defined. The result is that many people create and use data files that they think are in GFF3 format, but in reality are some hybrid format.

One thing that always bothered me was how coding sequences are handled. In GFF3 files, a separate line is required for each CDS segment, and yet the feature type “CDS” is almost used (as far as I’ve seen). The GFF3 spec states that feature types should adhere to the Sequence Ontology, and the SO definition for CDS relates to the entire coding sequence, not coding segments that are often interspersed with introns. Shouldn’t they be using “CDS_fragment” to refer to these disjoint regions of the coding sequence?

I reviewed parts of the GFF3 spec in depth yesterday and found something I had not noticed before. I had always considered each separate line in a GFF3 file as a distinct feature. Indeed that is typically the case, but the spec does allow for a single feature to be defined on multiple lines. Any lines that share the same ID attribute correspond to the same feature. This is GFF3’s mechanism for defining disjoint features. So the fact that the “CDS” so term is used is not a problem–assuming that all of the lines corresponding to a single CDS share the same “ID” attribute.

I don’t know how and why I missed this before. The GFF3 spec was revised as recently as 7 months ago, so perhaps this is a recent addition to the official spec. Regardless, this makes much more sense to me now and seems to be much more consistent. I just need to make sure the software I am writing to process GFF3 handles multi-line features correctly.

Story of my life right now

Yesterday’s XCKD comic could not have been more timely. This week I am trying to gather whole-genome annotations for a variety of model organisms–well, I am in fact gathering two sets of annotations for each organism (for comparison). My real trouble hasn’t been downloading the data to my local machine (although navigating a smorgasbord of genome browsers and FTP sites has been “fun”). My real trouble begins once I have the data in hand.

BED, GTF, GFF2, GFF3, XML; all too loosely defined (or too loosely adhered to) to enable any kind of reliable conversion utilities. So I’m stuck searching for conversion scripts on Google, hoping I find one that works for my particular data set…until I throw my hands up in the air and consign myself to writing yet another Perl script that will take 5 minutes to code and 2 hours to debug.

I’m glad I’m not the only one that feels this way. Take a look at the top answer to this thread in a bioinformatics Q&A forum.

At times I’ve felt that, given enough time, I could come up with a solution that suits everybody’s needs. But then that just puts us right back to where we started (refer again to the XKCD comic). The problem isn’t formats, the problem is people.