Training the SNAP ab initio gene predictor

I’m in the process of annotating the genome of a non-model insect species (using the Maker annotation pipeline). Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). I may try this approach some time, but intuitively the models should be more accurate if they are trained on a training set in which we have a high level of confidence. Luckily, my advisor has been working on identifying genes conserved across insects (including my species of interest), so he was able to furnish a high-quality data set for training the gene finders: a GFF3 file containing spliced alignments of conserved gene models against the genome of interest (for which I already had the Fasta file).

To train the SNAP gene predictor, I followed the procedure briefly described in the software’s README file. However, this did require me to pre-process the data a bit–and there were a few gotchas along the way. This post is a record of the process.

Converting from GFF3 to ZFF

SNAP uses a non-standard format dubbed ZFF (also described briefly in the README) for model training. I wrote the following script (gff3_to_zff.pl) to convert the gene models my advisor had given me in GFF3 format to the ZFF format.

#!/usr/bin/env perl
use strict;

my @exons;
my $gene_count = 0;
my $current_seq = "";
while(my $line = <STDIN>)
{
  if($line =~ m/^###/)
  {
    flush(\@exons);
    next;
  }
  my @fields = split(/\t/, $line);
  if($fields[2] eq "mRNA")
  {
    flush(\@exons);
  }
  elsif($fields[2] eq "exon")
  {
    if ($fields[0] ne $current_seq)
    {
      $current_seq = $fields[0];
      printf(">%s\n", $current_seq);
    }
    push(@exons, \@fields);
  }
}
flush();

sub flush
{
  my $num_exons = scalar(@exons);
  return if($num_exons == 0);
  
  my $group = sprintf("%s.%d", $exons[0]->[0], $gene_count);
  $gene_count++;
  
  if($num_exons == 1)
  {
    my($start, $end) = ($exons[0]->[3], $exons[0]->[4]);
    if($exons[0]->[6] eq "-")
    {
      ($start, $end) = ($exons[0]->[4], $exons[0]->[3]);
    }
    printf("Esngl\t%lu\t%lu\t%s\n", $start, $end, $group);
  }
  else
  {
    @exons = reverse(@exons) if($exons[0]->[6] eq "-");
    for(my $i = 0; $i < $num_exons; $i++)
    {
      my $exon_type = "Exon";
      if($i == 0)
      {
        $exon_type = "Einit";
      }
      elsif($i == $num_exons - 1)
      {
        $exon_type = "Eterm";
      }
      
      my($start, $end) = ($exons[$i]->[3], $exons[$i]->[4]);
      if($exons[0]->[6] eq "-")
      {
        ($start, $end) = ($exons[$i]->[4], $exons[$i]->[3]);
      }
      
      printf("%s\t%lu\t%lu\t%s\n", $exon_type, $start, $end, $group);
    }
  }
  @exons = ();
}

Trimming and sorting the genome sequences

With the gene models in ZFF format and the genome in Fasta format, I was ready to begin the training procedure…or so I thought. I was getting some strange unexpected errors, and after a bit of digging it appeared that the errors were due to sorting differences in the two files. It looks like the SNAP training program expects all of the gene annotations corresponding to the first sequence in the genome file to come first, and then follow the same order throughout the rest of the file. I hadn’t realized before, but the sequences in the genome Fasta file were sorted numerically (scaffold_1, scaffold_2, scaffold_3, etc) whereas the gene models in the annotation file were sorted lexicographically by sequence (scaffold_1, scaffold_10, scaffold_100, etc). I wrote a quick Perl script to sort the Fasta file, but still had problems. This time, the issue was that the order was still not correct since the genome contained some sequences for which no gene models were provided in the annotation file. To preserve the same order in both files, I had to remove these sequences from the genome file. The following script (fasta_sort.pl) took care of both the sorting and filtering.

#!/usr/bin/env perl
use strict;
use Bio::SeqIO;

my $filterfile = $ARGV[0];
my $seqs_to_keep = {};
open(my $FILTER, "<", $filterfile) or die("Error: could not open '$filterfile' $!");
while(<$FILTER>)
{
  chomp();
  $seqs_to_keep->{$_} = 1;
}
close($FILTER);

my $seqs = {};
my $loader = Bio::SeqIO->new(-fh => \*STDIN, -format => 'Fasta');
while(my $seq = $loader->next_seq)
{
  $seqs->{$seq->id} = $seq;
}

my @keys = sort(keys(%$seqs));
my $writer = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta');
foreach my $seqid(@keys)
{
  $writer->write_seq($seqs->{$seqid}) if($seqs_to_keep->{$seqid});
}

Training

At this point, training proceeded as expected and I was able to train a SNAP model I can use for my genome of interest. The entire procedure is documented below.

perl gff3_to_zff.pl < myinsect-conserved-genes.gff3 > myinsect.ann
grep '^>' myinsect.ann | tr -d '>' > myinsect.seqs2keep
perl fasta_sort.pl myinsect.seqs2keep < myinsect-genome.fasta > myinsect.dna
fathom myinsect.ann myinsect.dna -gene-stats > gene-stats.log 2>&1
fathom myinsect.ann myinsect.dna -validate > validate.log 2>&1
fathom myinsect.ann myinsect.dna -categorize 1000 > categorize.log 2>&1
fathom uni.ann uni.dna -export 1000 -plus > uni-plus.log 2>&1
mkdir params
cd params
forge ../export.ann ../export.dna > ../forge.log 2>&1
cd ..
hmm-assembler.pl myinsect params/ > myinsect.hmm
Advertisements

5 comments

  1. Andrea Gonzalez

    Hi Daniel,

    I have a question regarding the gff3_to_zff.pl script.I have a directory with more than 1000 .gff files and I want to create a single .zff from all of these. I was going to use the maker2zff script, but it gives empty .ann and .dna files. Could you help me out with this? And after converting those files to .zff, why did you do this: grep ‘^>’ myinsect.ann | tr -d ‘>’ > myinsect.seqs2keep? Thanks in advance.

  2. Daniel Standage

    I don’t think it will be a problem to create a single .zff file from your 1000s of .gff3s. If my data were in multiple files, I would try something like this (in place of the first step).

    for file in *.gff
    do
      perl gff_to_zff.pl < $file >> myinsect.ann
    done
    

    I would then proceed as normal with the rest of the steps.

    The second line does two things. The grep command extracts all the lines from the file starting with a > character, and the tr command strips the > character out, just leaving the sequence ID. I did this because I had a file containing many gene sequences, and I needed to extract only the sequences to be used for training SNAP.

    Hope this helps.

  3. Kumar

    Hi Daniel, I tried your gff3_to_Zff.pl script and it returns an empty *.ann file. My gff3 is generated from a maker run. Any ideas?

  4. Colin D

    thanks for the great info. I tried training this on a maker gff and it seemed like SNAP was only predicting exons instead of other features. If I ran snap with “minimal.hmm” then it did output the other features though. Wondering maybe what might have caused that?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s