Comparing human genome annotation builds

I have spent most of my time, both as an undergraduate and graduate student, working with data from plant genomes. I feel fairly comfortable finding and navigating most publicly available plant data resources. This week I decided to test some annotation comparison software I’ve written recently with the two latest builds of the human genome. This required a lot more work than I expected, so here I have documented the issues I encountered and the solutions I came up with.

The first complication is that there is no “definitive” set of gene structure annotations for the human genome. UCSC releases annotations with every new assembly of the human genome (with labels like “hg18” and “hg19”). Ensembl releases annotations much more frequently, resulting in multiple annotation sets per assembly (they use labels like “NCBI36.54” or “GRCh37.55”, where NCBI36 = hg18, GRCh37 = hg19, and the last number refers to the Enseml build number). Considering that UCSC and Ensembl have different gene annotation pipelines/methodologies, choosing a representative pair of genome annotations isn’t necessarily an objective process. I didn’t delve too deeply into this: for simplicity’s sake, I simply downloaded the latest two builds (GRCh37.61 and GRCh37.62) from Ensembl’s FTP site.

The other complication is that I’ve worked almost entirely with GFF3-formatted annotations in the past, while it seems that human genome annotations are only available in GTF (legacy) or BED (newfangled) formats. I don’t have anything against either of these formats, it’s just that my processing and analysis tools expect GFF3. It’s always a pain when there are 6 different data formats to deal with. To make things even more interesting, there are slight differences between how UCSC uses GTF and how Ensembl uses GTF, so some software will not work with one or the other. Whatever…

Today I hacked together a pipeline to convert the data to the format I needed. Below are the commands I used, corresponding explanations for each command, and relevant source code.

perl ensembl_gtf_to_gff.pl < Homo_sapiens.GRCh37.61.gtf > Homo-sapiens-unfiltered-noutrs-nointrons.gff3
grep '^chr[0-9][0-9]*\|^chr[XY]' Homo-sapiens-unfiltered-noutrs-nointrons.gff3 > Homo-sapiens-noutrs-nointrons.gff3
perl add_utrs.pl < Homo-sapiens-noutrs-nointrons.gff3 > Homo-sapiens-nointrons.gff3
gt gff3 -addintrons -retainids -sort -tidy Homo-sapiens-nointrons.gff3 > Homo-sapiens.gff3
  1. First, I tracked down an existing GTF-to-GFF3 converter (from this thread). I modified it slightly to filter out non-protein-coding genes and to extract CDS fragments in addition to exons (see ensembl_gtf_to_gff.pl below) and converted the Ensembl GTF data to GFF3 format.
  2. Next I used grep to filter out any genes that were not mapped to a chromosome in the nuclear genome.
  3. Then, I wrote a script (see add_utrs.pl below) to add explicit UTR locations (as inferred from the exon and CDS features) and ran it on the filtered GFF3 data.
  4. Finally, I used GenomeTools to add explicit intron locations and sort/tidy the data.

I repeated this process for both sets of annotations to get the two GFF3 files I need. It took most of the morning to figure this all out, but I finally have data in the format I need!

ensembl_gtf_to_gff.pl

use strict;

my @v;
my ($g,$tr,$last_tr,$last_g);
my @exons=();
my @cdss=();
my @trs=();
my $counter = 0;

while (<>) {
	@v = split /\t/;
	next unless $v[1] eq "protein_coding" and ($v[2] eq 'exon' or $v[2] eq 'CDS');
	$v[0] = 'chr' . $v[0] unless index($v[0], 'chr') == 0;
	($g,$tr) = $v[8] =~ /gene_id "(ENSG.{11})"; transcript_id "(ENST.{11})/;
	if ($tr ne $last_tr) {
		push @trs, [@exons];
		if ($g ne $last_g) {
			process(@trs) if defined $last_g;
			@trs=();
			$last_g = $g;
			print STDERR '.' if $counter % 10000 == 0;
		}
		$last_tr = $tr;
		@exons = [@v];
	} else {
		push @exons, [@v];
	}
}
process(@trs);

sub process {
	my @trs = @_;
	return unless scalar @trs;
	my @all_starts = ();
	my @all_ends = ();
	my @out = ();
	my @tr_here;
	my ($gid, $tid, $exn, $gname);
	my (@starts, @ends, $start, $end);
	my (@transcript, @gene);
	my $exon_info_string;


	($gid) = $trs[0]->[0]->[8] =~ /gene_id "([^"]+)"/;
	($gname) = $trs[0]->[0]->[8] =~ /gene_name "([^"]+)"/;
	$gname =~ tr/;/./;

	foreach my $exons (@trs) {
		@starts = sort {$a<=>$b} map { $_->[3] } @$exons;
		@ends = sort {$b<=>$a} map { $_->[4] } @$exons;
		$start = $starts[0];
		$end = $ends[0];

		($tid) = $exons->[0]->[8] =~ /transcript_id "([^"]+)"/;
		$exon_info_string = "Parent=$tid\n";

		@transcript = @{$exons->[0]};
		$transcript[2] = 'mRNA';
		$transcript[3] = $start;
		$transcript[4] = $end;
		$transcript[8] =  "ID=$tid;Name=$gname;Parent=$gid\n";

		push @all_starts, $start;
		push @all_ends, $end;
		foreach (reverse @$exons) {
			($exn) = $_->[8] =~ /exon_number "(\d+)"/;
			$_->[8] = $exon_info_string;
			push @out, join("\t", @$_);
		}
		push @out, join("\t", @transcript);
	}
	@all_starts = sort {$a<=>$b} @all_starts;
	@all_ends = sort {$b<=>$a} @all_ends;
	@gene = @transcript;
	$gene[2] = 'gene';
	$gene[3] = $all_starts[0];
	$gene[4] = $all_ends[0];
	$gene[8] = "ID=$gid;Name=$gname\n";
	push @out, join("\t", @gene);
	print reverse @out;
}

add_utrs.pl

use strict;
use Bio::Tools::GFF;

my $loader = Bio::Tools::GFF->new( -fh => \*STDIN,  -gff_version => 3 );
my $writer = Bio::Tools::GFF->new( -fh => \*STDOUT, -gff_version => 3 );

my $current_trns;
my $current_trns_id = "";
my @current_exons = ();
my @current_cdss = ();

while(my $feature = $loader->next_feature())
{
  my $type = $feature->primary_tag();
  if($type eq "gene")
  {
    $writer->write_feature($feature);
  }
  elsif($type eq "CDS")
  {
    push(@current_cdss, $feature);
  }
  elsif($type eq "exon")
  {
    push(@current_exons, $feature);
  }
  elsif($type eq "mRNA")
  {
    if($current_trns_id ne "" and scalar(@current_cdss) > 0)
    {
      process();
    }
    ($current_trns_id) = $feature->get_tag_values("ID");
    $current_trns = $feature;
    @current_exons = ();
    @current_cdss = ();
  }
  else
  {
    die("Unknown feature type '$type' $!");
  }
}
if(scalar(@current_cdss) > 0)
{
  process();
}

sub process
{
  my $cds_start = -1;
  my $cds_end = -1;
  foreach my $cds(@current_cdss)
  {
    if($cds_start == -1 or $cds->location->start < $cds_start)
    {
      $cds_start = $cds->location->start;
    }
    if($cds_end == -1 or $cds->location->end > $cds_end)
    {
      $cds_end = $cds->location->end;
    }
  }
  my @utrs = ();
  my $front_type = "five_prime_UTR";
  my $back_type = "three_prime_UTR";
  if($current_trns->location->strand == -1)
  {
    $front_type = "three_prime_UTR";
    $back_type = "five_prime_UTR";
  }   
  foreach my $exon(@current_exons)
  {   
    if($exon->location->start < $cds_start)
    {   
      my $utr_start = $exon->location->start;
      my $utr_end = $exon->location->end;
      if($exon->location->end >= $cds_start)
      {   
        $utr_end = $cds_start - 1;
      }   
      my $utr = Bio::SeqFeature::Generic->new
      (   
        -seq_id => $exon->seq_id,
        -source_tag => $exon->source_tag,
        -primary => $front_type,
        -start => $utr_start,
        -end => $utr_end,
        -strand => $exon->location->strand,
        -frame => ".",
      );  
      $utr->add_tag_value("Parent", $current_trns_id);
      push(@utrs, $utr);
    }   
    if($exon->location->end > $cds_end)
    {   
      my $utr_start = $exon->location->start;
      my $utr_end = $exon->location->end;
      if($exon->location->start <= $cds_end)
      {   
        $utr_start = $cds_end + 1;
      }
      my $utr = Bio::SeqFeature::Generic->new
      (
        -seq_id => $exon->seq_id,
        -source_tag => $exon->source_tag,
        -primary => $back_type,
        -start => $utr_start,
        -end => $utr_end,
        -strand => $exon->location->strand,
        -frame => ".",
      );
      $utr->add_tag_value("Parent", $current_trns_id);
      push(@utrs, $utr);
    }
  }
  $writer->write_feature($current_trns);
  foreach my $feat((@current_exons, @current_cdss, @utrs))
  {
    $writer->write_feature($feat);
  }
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s