Category: Genomes

Basic summary stats for all genomes at NCBI

I’m wrapping up a manuscript this week (woo hoo!), and at one point was looking for a reference for the claim that a genome-wide GC content of ≈30% is one of the lowest observed in any animal genome. I looked at a couple of recently published animal genomes, and couldn’t even find a mention of the genome’s overall nucleotide composition. I know all of the data is accessible from NCBI, but there are hundreds of published animal genomes, and I was not keen on downloading all those genome sequences and computing GC content. Plus, I mean come on, NCBI should have this type of information easily accessible somewhere, right?

It turns out my intuition was correct. There is a “Browse by organism” link on NCBI’s genomes page that brings up a tabular readout of basic summary stats for all genome assemblies submitted to NCBI. Finding the information I needed was as simple as clicking the “Eukaryotes” tab, filtering by group (“Animals”), and sorting by %GC content (they also provide info on genome size and content as well). After skipping the references with missing values for %GC content, I was able to confirm the claim, so I referenced this resource in the paper.

New visualization tools

I just saw two Bioinformatics applications notes in my RSS feed this week that caught my attention, both web-based tools for visualizing genome annotation.

ChromoZoom is a genome browser that offloads the bulk of the processing to the browser client, as opposed to the more common bulky Perl-heavy server-side genome browsers like GBrowse and xGDB. Unlike other browsers that focus on client side processing, ChromoZoom can load flat files without the need for preprocessing, and it provides a bit of newfangled UI swag in the form of smooth zooming and intertial scrolling.

Scribl is a JavaScript-based graphics library designed to facilitate the development of dynamic web-based tools for visualizing genomic features. The examples provided by Scribl all look very rudimentary (almost unprofessionally so), although the Rover genome browser powered by Scribl looks significantly better. As far as APIs for genome feature visualization go, my only experience is with the GenomeTools AnnotationSketch API (with which I’ve been quite pleased, I must say), so it is nice to see this type of functionality being extended into the domain of web-based languages (who doesn’t like programming in JavaScript?).

I’m not ready to invest too much time in either of these tools yet. Both of them have some significant limitations. However, I am excited about the direction things are going, and I could definitely see myself utilizing them (or their descendants or related tools) some time in the near future as the need arises.

Trimmomatic for cleaning NGS reads

Trimming and cleaning are important first steps in any analysis involving Illumina short read data. I’m somewhat familiar with the experimental protocols and why trimming, clipping, and filtering need to be done, but honestly I haven’t invested too much effort in familiarizing myself with the details of various implementations. I’ve essentially just used an R script from a colleague thus far for all my NGS preprocessing needs.

This week I came across a link for the Trimmomatic tool, which seems to be a well-implemented and reliable tool. I’m interested in comparing the performance of Trimmomatic versus my R script. I’m assuming Trimmomatic will be much faster and produce results of at least the same quality (if not better), in which case I would be happy to scrap the R script for good!

Error correction with AllPaths-LG

My experience with the AllPaths-LG genome assembler has been great. Anecdotal evidence as well as some rigorous studies both seem to corroborate this. Earlier this week, I came across a link to this post on the AllPaths-LG blog.

The ALLPATHS-LG fragment read error correction modules are now available for stand alone use outside the full assembly pipeline. You can now easily error correct your own data using our method without running an assembly.

I’m glad I saw this when I did. I recently read over Steven Salzberg et al.‘s GAGE evaluation, and remembered the paper mentioning that for many (most?) assemblers, using AllPaths-LG’s error correction module as opposed to QUAKE gave better results. I’m definitely interested to try this out.

Training the SNAP ab initio gene predictor

I’m in the process of annotating the genome of a non-model insect species (using the Maker annotation pipeline). Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). I may try this approach some time, but intuitively the models should be more accurate if they are trained on a training set in which we have a high level of confidence. Luckily, my advisor has been working on identifying genes conserved across insects (including my species of interest), so he was able to furnish a high-quality data set for training the gene finders: a GFF3 file containing spliced alignments of conserved gene models against the genome of interest (for which I already had the Fasta file).

To train the SNAP gene predictor, I followed the procedure briefly described in the software’s README file. However, this did require me to pre-process the data a bit–and there were a few gotchas along the way. This post is a record of the process.

Converting from GFF3 to ZFF

SNAP uses a non-standard format dubbed ZFF (also described briefly in the README) for model training. I wrote the following script ( to convert the gene models my advisor had given me in GFF3 format to the ZFF format.

#!/usr/bin/env perl
use strict;

my @exons;
my $gene_count = 0;
my $current_seq = "";
while(my $line = <STDIN>)
  if($line =~ m/^###/)
  my @fields = split(/\t/, $line);
  if($fields[2] eq "mRNA")
  elsif($fields[2] eq "exon")
    if ($fields[0] ne $current_seq)
      $current_seq = $fields[0];
      printf(">%s\n", $current_seq);
    push(@exons, \@fields);

sub flush
  my $num_exons = scalar(@exons);
  return if($num_exons == 0);
  my $group = sprintf("%s.%d", $exons[0]->[0], $gene_count);
  if($num_exons == 1)
    my($start, $end) = ($exons[0]->[3], $exons[0]->[4]);
    if($exons[0]->[6] eq "-")
      ($start, $end) = ($exons[0]->[4], $exons[0]->[3]);
    printf("Esngl\t%lu\t%lu\t%s\n", $start, $end, $group);
    @exons = reverse(@exons) if($exons[0]->[6] eq "-");
    for(my $i = 0; $i < $num_exons; $i++)
      my $exon_type = "Exon";
      if($i == 0)
        $exon_type = "Einit";
      elsif($i == $num_exons - 1)
        $exon_type = "Eterm";
      my($start, $end) = ($exons[$i]->[3], $exons[$i]->[4]);
      if($exons[0]->[6] eq "-")
        ($start, $end) = ($exons[$i]->[4], $exons[$i]->[3]);
      printf("%s\t%lu\t%lu\t%s\n", $exon_type, $start, $end, $group);
  @exons = ();

Trimming and sorting the genome sequences

With the gene models in ZFF format and the genome in Fasta format, I was ready to begin the training procedure…or so I thought. I was getting some strange unexpected errors, and after a bit of digging it appeared that the errors were due to sorting differences in the two files. It looks like the SNAP training program expects all of the gene annotations corresponding to the first sequence in the genome file to come first, and then follow the same order throughout the rest of the file. I hadn’t realized before, but the sequences in the genome Fasta file were sorted numerically (scaffold_1, scaffold_2, scaffold_3, etc) whereas the gene models in the annotation file were sorted lexicographically by sequence (scaffold_1, scaffold_10, scaffold_100, etc). I wrote a quick Perl script to sort the Fasta file, but still had problems. This time, the issue was that the order was still not correct since the genome contained some sequences for which no gene models were provided in the annotation file. To preserve the same order in both files, I had to remove these sequences from the genome file. The following script ( took care of both the sorting and filtering.

#!/usr/bin/env perl
use strict;
use Bio::SeqIO;

my $filterfile = $ARGV[0];
my $seqs_to_keep = {};
open(my $FILTER, "<", $filterfile) or die("Error: could not open '$filterfile' $!");
  $seqs_to_keep->{$_} = 1;

my $seqs = {};
my $loader = Bio::SeqIO->new(-fh => \*STDIN, -format => 'Fasta');
while(my $seq = $loader->next_seq)
  $seqs->{$seq->id} = $seq;

my @keys = sort(keys(%$seqs));
my $writer = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta');
foreach my $seqid(@keys)
  $writer->write_seq($seqs->{$seqid}) if($seqs_to_keep->{$seqid});


At this point, training proceeded as expected and I was able to train a SNAP model I can use for my genome of interest. The entire procedure is documented below.

perl < myinsect-conserved-genes.gff3 > myinsect.ann
grep '^>' myinsect.ann | tr -d '>' > myinsect.seqs2keep
perl myinsect.seqs2keep < myinsect-genome.fasta > myinsect.dna
fathom myinsect.ann myinsect.dna -gene-stats > gene-stats.log 2>&1
fathom myinsect.ann myinsect.dna -validate > validate.log 2>&1
fathom myinsect.ann myinsect.dna -categorize 1000 > categorize.log 2>&1
fathom uni.ann uni.dna -export 1000 -plus > uni-plus.log 2>&1
mkdir params
cd params
forge ../export.ann ../export.dna > ../forge.log 2>&1
cd .. myinsect params/ > myinsect.hmm

Reference transcript data set for insects

I participate in several science- and programming-related online forums (as I discussed recently), and recently I’ve seen quite a lot of requests for pointers to some set of DNA sequences that can be used for testing this or that (for example, see this thread). A lot of these requests seem to come from programmers with little biological intuition that simply want/need to play around with some biomolecular sequence data. My first instinct is to tell them to just go to GenBank or RefSeq and download some sequences, but I guess the raw amount of data can be pretty intimidating for anyone that does not know precisely what they are looking for, biologist or not.

This morning I decided to take some time to create a small reference data set. I was planning on making it span most of eukaryotic diversity, but after realizing how long that would take, I decided to simply focus on insects (maybe I’ll do plants later, and then mammals, and so on—it’s much less of a drain if it can be broken up into smaller tasks). The database I created contains transcript sequences for 8 model insect genomes: Aedes aegypti, Atta cephalotes, Apis mellifera, Acyrthosiphon pisum, Bombus impatiens, Drosophila melanogaster, Harpegnathos saltator, and Nasonia vitripennis.

Rather than posting the dataset itself, I’ll go ahead and post the Makefile I used to put together the dataset. Enjoy!


# Full database
$(DBNAME):		aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa
				cat  aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa > $(DBNAME)
				rm aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa

# Download and decompress transcripts for each genome

# Aedes aegypti
				curl -o aaeg-trans.fa.gz
				gunzip aaeg-trans.fa.gz

# Atta cephalotes
				curl -o acep-trans.fa.gz
				gunzip acep-trans.fa.gz

# Apis mellifera
				curl -o amel-trans.fa.gz
				gunzip amel-trans.fa.gz

# Acyrthosiphon pisum
				curl -o apis-trans.fa.gz
				gunzip apis-trans.fa.gz

# Bombus impatiens
				curl -o bimp-trans.fa.gz
				gunzip bimp-trans.fa.gz

# Drosophila melanogaster
				curl -o dmel-trans.fa.gz
				gunzip dmel-trans.fa.gz

# Harpegnathos saltator
				curl -o hsal-trans.fa.gz
				gunzip hsal-trans.fa.gz

# Nasonia vitripennis
				curl -o nvit-trans.fa.gz
				gunzip nvit-trans.fa.gz


I later ran the following commands to standardize the Fasta deflines (headers).

perl -ne 'if(m/^>(AAEL\S+)/){printf(">gnl|Aedes_aegypti|%s\n", $1)}else{print}' < aaeg-trans.fa >> insect-transcripts.fa
perl -ne 's/Acep_1\.0/Atta_cephalotes/; print' < acep-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Apis mellifera/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Apis_mellifera|%s\n", $1)}else{print}' < amel-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Acyrthosiphon pisum/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Acyrthosiphon_pisum|%s\n", $1)}else{print}' < apis-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Bombus impatiens/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Bombus_impatiens|%s\n", $1)}else{print}' < bimp-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/^>(\S+)/){printf(">gnl|Drosophila_melanogaster|%s\n", $1)}else{print}' < dmel-trans.fa >> insect-transcripts.fa 
perl -ne 's/Hsal_3.3/Harpegnathos_saltator/; print' < hsal-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Nasonia vitripennis/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Nasonia_vitripennis|%s\n", $1)}else{print}' < nvit-trans.fa >> insect-transcripts.fa

Transcriptome diversity

The abstract for this paper came up in my RSS feed recently, and I immediately took interest. Upon further reading, I noticed that the study was limited to data from human cerebella, but the punchline is still fascinating: alternative transcription mechanisms are responsible for a greater portion of transcriptome diversity than splicing mechanisms!

Multi-line features in GFF3

Out of the many plain text tab-delimited formats for storing genome annotations, the GFF3 format probably has the clearest specification, the least amount ambiguity, and the most flexibility. One of GFF3’s biggest problems, however, is that it builds on older, similar formats that are ambiguous and weakly defined. The result is that many people create and use data files that they think are in GFF3 format, but in reality are some hybrid format.

One thing that always bothered me was how coding sequences are handled. In GFF3 files, a separate line is required for each CDS segment, and yet the feature type “CDS” is almost used (as far as I’ve seen). The GFF3 spec states that feature types should adhere to the Sequence Ontology, and the SO definition for CDS relates to the entire coding sequence, not coding segments that are often interspersed with introns. Shouldn’t they be using “CDS_fragment” to refer to these disjoint regions of the coding sequence?

I reviewed parts of the GFF3 spec in depth yesterday and found something I had not noticed before. I had always considered each separate line in a GFF3 file as a distinct feature. Indeed that is typically the case, but the spec does allow for a single feature to be defined on multiple lines. Any lines that share the same ID attribute correspond to the same feature. This is GFF3’s mechanism for defining disjoint features. So the fact that the “CDS” so term is used is not a problem–assuming that all of the lines corresponding to a single CDS share the same “ID” attribute.

I don’t know how and why I missed this before. The GFF3 spec was revised as recently as 7 months ago, so perhaps this is a recent addition to the official spec. Regardless, this makes much more sense to me now and seems to be much more consistent. I just need to make sure the software I am writing to process GFF3 handles multi-line features correctly.

Define: locus

In biology, the term locus is pretty loosely defined. In the most general sense, it refers to a specific position on a chromosome or other large sequence of genomic DNA. A locus can refer to a single nucleotide, or it can even refer to a large genomic neighborhood. The most common interpretation of the term locus is the location of a gene on the chromosome.

Recently I’ve been working a lot with gene structure annotation: specifically, comparing one set of annotations to another set. For this work, I needed a much more precise definition for a locus. If one set of annotations has overlapping gene models, should I treat each gene model as a separate locus or should I include them together in an aggregate locus? When comparing one set of annotations to the other, should I use gene models from one set to determine the “true” loci, or should I include gene models from both sets in this process?

Given two sets of gene annotations, the simplest approach to determine loci is probably to treat one set as the reference and to say that each gene in that set corresponds to a distinct locus. However, this approach requires us to compare annotations not only at each locus but also in the “intergenic” space between loci, since gene annotations from the second set will not line up perfectly with the gene annotations from the first (reference) set.

After pondering this issue for a while, I decided it makes much more sense to use both sets of gene models simultaneously to determine the loci. The provides several benefits, including the fact that we make no assumptions as to which set of annotations is higher quality (which is more likely to be representative of the “true” loci), and the fact that we do not need to analyze the (possibly) large intergenic spaces between loci (since every gene corresponds to a locus). With these considerations, I currently use the following precise definition of locus when working with gene structure annotations.

Given a set of gene structure annotations, a locus is a maximal region of the genomic sequence containing either a single gene annotation or a set of gene annotations in which every gene in the locus overlaps with at least one other gene in the locus.

Maize genome annotations

This week I needed to get the latest maize gene annotations (from MaizeGDB) for comparison against a different set, and the experience was very similar to the one I had a couple of weeks ago with human genome annotations. Again, I feel the need to document this experience for future reference.

MaizeGDB vs

The first thing to consider is that MaizeGDB gets all of its gene annotations from After learning this fact, I decided to go directly to the source of the data rather than download it from a second party.

Assembly version

MaizeGDB has downloadable annotations for two assemblies: B73 RefGen v1 and B73 RefGen v2. You need to make sure you get annotations for the appropriate assembly. This is complicated by the fact that uses different identifiers for the assemblies: “release 4” refers to version 1, and “release 5” refers to version 2.

Gene sets

For both assembly versions, there are two sets of gene structure annotations: a “working” gene set (annotations with relaxed quality control) and a “filtered” gene set (annotations with stricter quality control). You need to decide whether you want to have more genes and risk including possible false positives, or whether you want to reduce false positives at the risk of missing some real genes.

MaizeGDB uses the abbreviations “WGS” and “FGS” to refer to the working and filtered gene sets (respectively). also uses these abbreviations for release 4, but in release 5 uses separate identifiers for the working and filtered gene sets: release “5a” is the working set, and release “5b” is the filtered set.