Category: Algorithms

New visualization tools

I just saw two Bioinformatics applications notes in my RSS feed this week that caught my attention, both web-based tools for visualizing genome annotation.

ChromoZoom is a genome browser that offloads the bulk of the processing to the browser client, as opposed to the more common bulky Perl-heavy server-side genome browsers like GBrowse and xGDB. Unlike other browsers that focus on client side processing, ChromoZoom can load flat files without the need for preprocessing, and it provides a bit of newfangled UI swag in the form of smooth zooming and intertial scrolling.

Scribl is a JavaScript-based graphics library designed to facilitate the development of dynamic web-based tools for visualizing genomic features. The examples provided by Scribl all look very rudimentary (almost unprofessionally so), although the Rover genome browser powered by Scribl looks significantly better. As far as APIs for genome feature visualization go, my only experience is with the GenomeTools AnnotationSketch API (with which I’ve been quite pleased, I must say), so it is nice to see this type of functionality being extended into the domain of web-based languages (who doesn’t like programming in JavaScript?).

I’m not ready to invest too much time in either of these tools yet. Both of them have some significant limitations. However, I am excited about the direction things are going, and I could definitely see myself utilizing them (or their descendants or related tools) some time in the near future as the need arises.

DELIMINATE: comparison with gzip, bzip2, and zip

Yesterday, I saw an abstract in my RSS feed for a new compression algorithm called DELIMINATE. Two things from the abstract caught my attention. First, it claimed to be lossless—I’ve seen abstracts for lossy compression algorithms before, but honestly I have no interest in introducing any additional error into any Illumina data, whose quality is already somewhat crappy. It also claimed improvements in compression speed and compression ratio.

I figured I would give the software a try. The download process was nearly painless (I did have to provide my name and email), and the software “just worked” as advertised out-of-the-box. I was able to confirm that the compression is indeed lossless, and that the compression speed and ratio is indeed superior to all the other commonly-used compression programs/algorithms (I recorded a terminal session of the tests in this asciicast). The decompression time, however, was quite a bit slower than all the alternatives. This, in addition to the practical consideration of installing yet another compression program on every single computer you use frequently (and forcing my colleagues to do the same), is the only hesitation I have regarding DELIMINATE.

Improving performance of evolutionary algorithms with an ancestral cache

Today our department hosted a lunch for the graduate students to meet with the scientist who will be presenting during this afternoon’s department seminar, and I’m glad I went (not just for the pizza). This semester’s first speaker is actually a computer scientist–I’m pleasantly surprised since I’m in a genetics department. Anyway, it was a great chalk talk and I look forward to the formal presentation.

During lunch we chatted with the speaker about evolutionary algorithms. I first learned about evolutionary algorithms as an undergrad. Despite the term’s connotation, evolutionary algorithms have a lot more to do with computer science and problem solving in general than biology. The algorithms borrow ideas from evolutionary theory (such as mutation, fitness, and selection) to try to guess better solutions to a problem. If you have a population of potential answers to a solution, you can mutate that population, select potential solutions with desirable traits, and repeat the process. Over time, these algorithms have the capability to generate nearly-optimal solutions to problems that are otherwise very difficult to solve.

The main gist of the chalk talk is that you can improve the performance of an evolutionary algorithm by drawing from an ancestral cache. Essentially, the idea is that every once in a while, you mutate a solution by reaching back several generations and copying characteristics of an ancestral solution. When he first tried this approach, he got better results when pulling from grandparent- and greatgrandparent-solutions than from parent solutions. However, when he tried reaching back even further (hundreds or even thousands of generations back), the performance improvement was even more pronounced. We discussed possible analogs to evolutionary theory as it applies to living organisms (as opposed to problems formed in mathematical terms), one of which is a putative repair mechanism for Arabidopsis. Overall, it was a very interesting and fun discussion.

Model vectors revisited: gene structure comparison, maximal transcript cliques, and the Bron-Kerbosch algorithm

While a string of As, Cs, Gs, and Ts is a gross simplification of the DNA molecule’s chemical complexity, knowing which nucleotides occur in which sequence is sufficient for many (if not most) applications in genomics and bioinformatics. Additionally, representing DNA sequences as character strings simplifies the process of implementing algorithms for sequence analysis. Thus the Fasta format has become nearly ubiquitous in the life sciences.

There is far less consensus, however, when it comes to formats for genome annotations. This area is dominated mostly by tab-delimited plain text formats such as GFF3, BED, and GTF. These formats are very flexible and relatively simple to parse using command-line tools. Unfortunately, this flexibility comes with several limitations, one of which is the fact that it is not always straightforward to compare two sets of annotations stored in these formats. If genome annotations could be represented by a single character string, however, then comparison and analysis of annotations becomes much more simple.

This is the basic idea behind gene model vectors, a concept I discussed previously. Where the Fasta format uses characters to classify each nucleotide in terms of nitrogenous base content, a gene model vector uses characters to classify each nucleotide in terms of gene structure (i.e., whether the nucleotide is part of an intron, the coding sequence, untranslated region, etc). This representation makes analysis and comparison of gene structure annotations much simpler.

However, there is a limitation to this approach as well. For a locus that contains an alternatively spliced gene or multiple overlapping gene models, any given nucleotide may be associated with multiple transcripts and will thus have more than one annotation. In this case, a single model vector is not sufficient to represent all annotated information at that locus (see my previous post on locus annotation). Creating a single model vector for every single transcript may not make sense though, since non-overlapping transcripts should be represented simultaneously in the same model vector when possible. Basically, we want to represent transcript annotations for the locus using the fewest possible model vectors, packing as much information as we can into each vector without overlapping transcripts.

After a bit of research, I found that this essentially reduces to a graph theory problem. First we treat each transcript annotation as a node in a graph, and then we add an edge between two nodes if they do not overlap. We want to group the transcripts into the smallest number of groups such that no transcripts in a group overlap and all transcripts are part of at least one group. This is equivalent to enumerating all maximal cliques in the transcript graph just described (in graph theory, this is called the maximal clique enumeration problem). If we can find all of the maximal cliques in the graph, then we can represent each clique with a single model vector and the number of cliques has been minimized.

Unfortunately, the maximal clique enumeration problem is NP-complete, meaning that there is no efficient algorithm for solving it. This shouldn’t be too much of a concern, however, since overlapping gene models are pretty rare and the number of splice variants for alternatively spliced genes is typically low. The standard Bron-Kerbosch algorithm, with an exponential runtime in the worst case, is more than sufficient.

When comparing two sets of gene annotations, I use this algorithm in the following workflow.

  • First, I determine the position of each locus (see my previous post on locus annotation).
  • Next, for each locus, I consider the two sets of gene anotations separately and use the Bron-Kerbosch algorithm to enumerate all maximal transcript cliques for each set.
  • Then, I generate a model vector for each maximal clique using the transcript(s) belonging to that clique.
  • Finally, I compare each each model vector from the first set with each model vector from the second set to calculate comparison statistics for the locus.

Permutation algorithm

In a previous post, I described how I am using model vectors to represent gene structure along a genomic sequence. I have also been working on a program that will compare two model vectors and print out summary statistics of the comparison. This is particularly handy if you want to compare two sets of gene annotations (from separate gene prediction tools) for the same genomic region. Believe me, you do not want to compare GFF3 files manually…

My program needed reusable tools to read in annotations in GFF3 format and then generate a model vector for any given region along the genomic sequence. The algorithm required for this was surprisingly complex, and there are two reasons. First, we don’t specify how many gene models are in that region. Second, if there are indeed multiple gene models in that region and those gene models are alternatively spliced, then a single model vector is insufficient to represent all possible gene structures for that region. The algorithm needed to determine how many gene models there are, how many splice variants there are per gene model, and then generate model vectors representing every possible way to combine one transcript from each gene model.

Imagine we have a tool that will give us all of the gene models in a particular region with the following data structure.

# This is an (over)simplistic example. Real gene models would of course be much longer.
my $gene_models =
{
  gene_1 =>
  {
    gene_1_transcript_3 =>
    {
      start => 1,
      end => 21,
      vector => 'FFFCCCCCCCCCIIICCCTTT',
    },
    gene_1_transcript_1 =>
    {
      start => 1,
      end => 21,
      vector => 'FFFCCCIIICCCIIICCCTTT',
    },
    gene_1_transcript_2 =>
    {
      start => 1,
      end => 21,
      vector => 'GGGCCCIIICCCIIICCCTTT',
    },
  },
  gene_2 =>
  {
    gene_2_transcript_1 =>
    {
      start => 30,
      end => 42,
      vector => 'CCCIIICCCCTTT',
    },
    gene_2_transcript_2 =>
    {
      start => 30,
      end => 42,
      vector => 'CCCCCCCCCCTTT',
    },
  },
};

Given model vectors in this format and the start and end positions of the region, the following subroutine will generate vectors representing every possible combination of vectors from the gene models.

sub get_vectors_by_region
{
  my($gene_models, $region_start, $region_end) = @_;

  # Determine how many vectors are required to represent this region
  my $num_combinations = 1;
  my $transcript_models = {};
  while(my($gene_id, $gene_model) = each(%$gene_models))
  {
    my $num_transcripts =()= keys(%$gene_model);
    $num_combinations *= $num_transcripts;
  }

  # Generate the vectors
  my $region_vectors = {};
  for(my $v = 0; $v < $num_combinations; $v++)
  {
    # Generate the next combination of transcripts
    my $current_combination = {};
    while(my($gene_id, $gene_model) = each(%$gene_models))
    {
      my @vector_ids = keys(%$gene_model);
      my $M = scalar(@vector_ids);
      my $transcript_id = $vector_ids[$v % $M];
      $current_combination->{$transcript_id} = $gene_model->{$transcript_id};
    }

    # Place the transcript vectors appropriately in the region vector
    my $region_length = ($region_end - $region_start + 1);
    my $region_vector = "G"x$region_length;
    while(my($transcript_id, $transcript_model) = each(%$current_combination))
    {
      my $subvector = $transcript_model->{vector};
      my $front_trim = ($region_start - $transcript_model->{start});
      if($front_trim > 0)
      {
        $subvector = substr($subvector, $front_trim);
      }
      my $end_trim = ($region_end - $transcript_model->{end});
      if($end_trim < 0)
      {
        $subvector = substr($subvector, 0, $end_trim);
      }
      my $subvector_start = $transcript_model->{start} - $region_start;
      $subvector_start = 0 if($subvector_start < 1);
      substr($region_vector, $subvector_start, length($subvector)) = $subvector;
    }
    my $region_vector_id = join(":", keys(%$current_combination));
    $region_vectors->{$region_vector_id} = $region_vector;
  }

  # Congrats, here are the vectors required to represent all combinations of splice
  # variants in the region.
  return $region_vectors;
}

For the example above, this algorithm would return the following vectors.

$region_vectors =
{
  'gene_1_transcript_1:gene_2_transcript_1' => 'FFFCCCIIICCCIIICCCTTTGGGGGGGGCCCIIICCCCTTT',
  'gene_1_transcript_1:gene_2_transcript_2' => 'FFFCCCIIICCCIIICCCTTTGGGGGGGGCCCCCCCCCCTTT',
  'gene_1_transcript_2:gene_2_transcript_1' => 'GGGCCCIIICCCIIICCCTTTGGGGGGGGCCCIIICCCCTTT',
  'gene_1_transcript_2:gene_2_transcript_2' => 'GGGCCCIIICCCIIICCCTTTGGGGGGGGCCCCCCCCCCTTT',
  'gene_1_transcript_3:gene_2_transcript_1' => 'FFFCCCCCCCCCIIICCCTTTGGGGGGGGCCCIIICCCCTTT',
  'gene_1_transcript_3:gene_2_transcript_2' => 'FFFCCCCCCCCCIIICCCTTTGGGGGGGGCCCCCCCCCCTTT',
};

Here is source code for the entire test case if you want to run it yourself.

Of course, I wasn’t smart enough to come up with this all by myself. This solution is adapted from the non-recursive solution posted by polygenelubricants in this StackOverflow thread. My Perl code is a bit more hairy than the Java code he provides, but then again the problem is a bit more hairy with associative arrays than it is with simple character arrays!