Permutation algorithm

In a previous post, I described how I am using model vectors to represent gene structure along a genomic sequence. I have also been working on a program that will compare two model vectors and print out summary statistics of the comparison. This is particularly handy if you want to compare two sets of gene annotations (from separate gene prediction tools) for the same genomic region. Believe me, you do not want to compare GFF3 files manually…

My program needed reusable tools to read in annotations in GFF3 format and then generate a model vector for any given region along the genomic sequence. The algorithm required for this was surprisingly complex, and there are two reasons. First, we don’t specify how many gene models are in that region. Second, if there are indeed multiple gene models in that region and those gene models are alternatively spliced, then a single model vector is insufficient to represent all possible gene structures for that region. The algorithm needed to determine how many gene models there are, how many splice variants there are per gene model, and then generate model vectors representing every possible way to combine one transcript from each gene model.

Imagine we have a tool that will give us all of the gene models in a particular region with the following data structure.

# This is an (over)simplistic example. Real gene models would of course be much longer.
my $gene_models =
{
  gene_1 =>
  {
    gene_1_transcript_3 =>
    {
      start => 1,
      end => 21,
      vector => 'FFFCCCCCCCCCIIICCCTTT',
    },
    gene_1_transcript_1 =>
    {
      start => 1,
      end => 21,
      vector => 'FFFCCCIIICCCIIICCCTTT',
    },
    gene_1_transcript_2 =>
    {
      start => 1,
      end => 21,
      vector => 'GGGCCCIIICCCIIICCCTTT',
    },
  },
  gene_2 =>
  {
    gene_2_transcript_1 =>
    {
      start => 30,
      end => 42,
      vector => 'CCCIIICCCCTTT',
    },
    gene_2_transcript_2 =>
    {
      start => 30,
      end => 42,
      vector => 'CCCCCCCCCCTTT',
    },
  },
};

Given model vectors in this format and the start and end positions of the region, the following subroutine will generate vectors representing every possible combination of vectors from the gene models.

sub get_vectors_by_region
{
  my($gene_models, $region_start, $region_end) = @_;

  # Determine how many vectors are required to represent this region
  my $num_combinations = 1;
  my $transcript_models = {};
  while(my($gene_id, $gene_model) = each(%$gene_models))
  {
    my $num_transcripts =()= keys(%$gene_model);
    $num_combinations *= $num_transcripts;
  }

  # Generate the vectors
  my $region_vectors = {};
  for(my $v = 0; $v < $num_combinations; $v++)
  {
    # Generate the next combination of transcripts
    my $current_combination = {};
    while(my($gene_id, $gene_model) = each(%$gene_models))
    {
      my @vector_ids = keys(%$gene_model);
      my $M = scalar(@vector_ids);
      my $transcript_id = $vector_ids[$v % $M];
      $current_combination->{$transcript_id} = $gene_model->{$transcript_id};
    }

    # Place the transcript vectors appropriately in the region vector
    my $region_length = ($region_end - $region_start + 1);
    my $region_vector = "G"x$region_length;
    while(my($transcript_id, $transcript_model) = each(%$current_combination))
    {
      my $subvector = $transcript_model->{vector};
      my $front_trim = ($region_start - $transcript_model->{start});
      if($front_trim > 0)
      {
        $subvector = substr($subvector, $front_trim);
      }
      my $end_trim = ($region_end - $transcript_model->{end});
      if($end_trim < 0)
      {
        $subvector = substr($subvector, 0, $end_trim);
      }
      my $subvector_start = $transcript_model->{start} - $region_start;
      $subvector_start = 0 if($subvector_start < 1);
      substr($region_vector, $subvector_start, length($subvector)) = $subvector;
    }
    my $region_vector_id = join(":", keys(%$current_combination));
    $region_vectors->{$region_vector_id} = $region_vector;
  }

  # Congrats, here are the vectors required to represent all combinations of splice
  # variants in the region.
  return $region_vectors;
}

For the example above, this algorithm would return the following vectors.

$region_vectors =
{
  'gene_1_transcript_1:gene_2_transcript_1' => 'FFFCCCIIICCCIIICCCTTTGGGGGGGGCCCIIICCCCTTT',
  'gene_1_transcript_1:gene_2_transcript_2' => 'FFFCCCIIICCCIIICCCTTTGGGGGGGGCCCCCCCCCCTTT',
  'gene_1_transcript_2:gene_2_transcript_1' => 'GGGCCCIIICCCIIICCCTTTGGGGGGGGCCCIIICCCCTTT',
  'gene_1_transcript_2:gene_2_transcript_2' => 'GGGCCCIIICCCIIICCCTTTGGGGGGGGCCCCCCCCCCTTT',
  'gene_1_transcript_3:gene_2_transcript_1' => 'FFFCCCCCCCCCIIICCCTTTGGGGGGGGCCCIIICCCCTTT',
  'gene_1_transcript_3:gene_2_transcript_2' => 'FFFCCCCCCCCCIIICCCTTTGGGGGGGGCCCCCCCCCCTTT',
};

Here is source code for the entire test case if you want to run it yourself.

Of course, I wasn’t smart enough to come up with this all by myself. This solution is adapted from the non-recursive solution posted by polygenelubricants in this StackOverflow thread. My Perl code is a bit more hairy than the Java code he provides, but then again the problem is a bit more hairy with associative arrays than it is with simple character arrays!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s