# R: plot histogram as a curve

I typically use R for visualization, and I am familiar with a few basic visualization types. The simplest way to plot the distribution of a 1-dimensional data set is the histogram, available via the function `hist`. However, sometimes I prefer plotting a curve rather than the bars that `hist` creates. So given some 1-dimensional data, I typically end up using some combination of `sort` and `uniq -c` to convert the data to X and Y values that can be used with R’s `plot` function.

However, in the last couple of days I learned a nifty little shortcut that will save me the trouble in the future. Rather than pre-processing the data, I can store the histogram as an object and retrieve the X and Y values from that object. For example, this week I wanted to look at the distribution of read lengths in some Illumina data I had just cleaned. I used the following commands to plot the distribution as a curve rather than a histogram. No pre-processing required!

```lengths <- read.table("lengthdist.dat")
h <- hist(lengths\$V1, plot=FALSE)
plot(h\$mids, h\$density, type="l", col="red", main="", xlab="Read length", ylab="Frequency")
```

# Plotting log-scale axes in R

Wow, it feels like a long time since I have blogged, but it’s only been a few weeks. I’m teaching a class on computational genome science this semester, and taking another one on the evolution of genes and genomes, so yeah, coursework has been kicking me in the butt the last couple of months. But anyway, enough meta-blogging…

This week I was reading about codon usage bias in Mike Lynch’s excellent book The Origins of Genome Architecture. In the last section of Chapter 6, he highlights the perils of automatically attributing elevated usage of “non-optimal” codons to selection and ignoring the stochastic effects of population genetic processes. Figure 6.9 plots the equilibrium frequency of “optimal” codons given different mutation rates and different values of $2N_gs$ (the power of selection on a gene).

I wanted to use this figure for an assignment. Since it is not readily available in electronic form, and since scanning it in would be lame, and since the equations used to generate the plots are clearly laid out in the book, and since I’m a curious and headstrong scientist…yeah, I had no reason not to generate the plot myself using R. I quickly threw together an R script to plot the graphic.

```# The n variable represents different values of 2N_{g}s
n <- 1

# The x variable is the parameter of the equation and represents u/v, the
# mutation rate from "optimal" to "non-optimal" codon
x <- seq(0, 100, by=.0001)

y <- exp(n)/(exp(n)+x)
plot(x, y, type="l",
xlab="Mutation rate toward non-optimal codon (u/v)",
ylab="Equilibrium frequency of optimal codon (p̃)")

n <- 3
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="blue")

n <- 0
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="red")
```

This script worked, although it was immediately clear why the graphic in the book had used a log scale for the x-axis: my plot didn’t look that great. So I started trying to work out in my head whether I should transform the x values or the y values, and how I was going to properly label the x-axis, when I thought “there has got to be a better way to do this; I bet this is built in to R.” The help page for the `plot` function didn’t give any clues, but a bit more searching helped me find what I was looking for. The `log` option of the `plot` function lets you specify whether to plot the x-axis or the y-axis or both on a log scale (using “x”, “y”, and “xy” as arguments, respectively). I then found a cool trick to label the axis using exponents rather than decimals or the lame default scientific notation. Anyway, here is the updated code…

```# The n variable represents different values of 2N_{g}s
n <- 1

# The x variable is the parameter of the equation and represents u/v, the
# mutation rate from "optimal" to "non-optimal" codon
x <- seq(0, 100, by=.0001)

y <- exp(n)/(exp(n)+x)
plot(x, y, log="x", type="l", xaxt="n", xlim=c(0.01, 100),
xlab="Mutation rate toward non-optimal codon (u/v)",
ylab="Equilibrium frequency of optimal codon (p̃)")
ticks <- seq(-2, 2, by=1)
labels <- sapply(ticks, function(i) as.expression(bquote(10^ .(i))))
axis(1, at=c(0.01, 0.1, 1, 10, 100), labels=labels)

n <- 3
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="blue")

n <- 0
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="red")
```

…and the final product.

Many thanks to the following StackOverflow threads for pointing me in the right direction.

• On this one, I saw the `log` parameter for the `plot` function.
• On this one, I learned the cool trick for the axis labels.

# New visualization tools

I just saw two Bioinformatics applications notes in my RSS feed this week that caught my attention, both web-based tools for visualizing genome annotation.

ChromoZoom is a genome browser that offloads the bulk of the processing to the browser client, as opposed to the more common bulky Perl-heavy server-side genome browsers like GBrowse and xGDB. Unlike other browsers that focus on client side processing, ChromoZoom can load flat files without the need for preprocessing, and it provides a bit of newfangled UI swag in the form of smooth zooming and intertial scrolling.

Scribl is a JavaScript-based graphics library designed to facilitate the development of dynamic web-based tools for visualizing genomic features. The examples provided by Scribl all look very rudimentary (almost unprofessionally so), although the Rover genome browser powered by Scribl looks significantly better. As far as APIs for genome feature visualization go, my only experience is with the GenomeTools AnnotationSketch API (with which I’ve been quite pleased, I must say), so it is nice to see this type of functionality being extended into the domain of web-based languages (who doesn’t like programming in JavaScript?).

I’m not ready to invest too much time in either of these tools yet. Both of them have some significant limitations. However, I am excited about the direction things are going, and I could definitely see myself utilizing them (or their descendants or related tools) some time in the near future as the need arises.

# Color coding for custom anonymous user tracks in GBrowse2

GBrowse2 provides a pretty nice feature that enables anonymous users to upload their own data and set up a custom, private track for visualization in the context of all the permanent, official tracks provided by the host. A wide variety of configuration options are available to customize the look, feel, and behavior of each individual data track. If you have administrative access to the genome browser, you can even adding Perl callback functions to configuration settings to enable dynamic calculation of each genomic feature’s color, shape (glyph), label, hover text, and click action. This is a powerful and flexible approach that enables a high level of customization and integration.

This week, I wanted to add feature-specific color-coding to my custom data track. My first attempt was to add callback functions to the `fgcolor` and `bgcolor` configuration settings to calculate the feature’s color based on a discriminative GFF3 attribute. Unfortunately, scriptable configurations are not available for anonymous user tracks. This is undoubtedly for good reason (since allowing these would introduce significant security vulnerabilities), but this means I had to figure out another approach to color-coding.

The GBrowse mailing list suggested I encode a discriminative value in the “score” column of the (GFF3) data file and then use the `graded_segments` glyph for rendering. This did indeed provide some color coding. However, the `graded_segments` glyph only allows you to specify one base color, and modulating a feature’s score will only change the intensity or fill of that color, not the hue itself. I decided to look around a bit more, and discovered the `heat_map` glyph. This glyph allows you to specify two colors. Then, the color for each feature is determined by the mapping of that feature’s score to the continuous spectrum between the two specified colors. I liked this approach much better, as it gave me a much larger palette with which to work.

For my final solution, I ended up setting each feature’s score to a discrete value between 0 and 1 (inclusive), and then adding the following settings to my track configuration.

```glyph       = heat_map
min_score   = 0
max_score   = 1
start_color = red
end_color   = green
```

This turned out quite nicely. The value 0 maps to red, 0.25 maps to orange, 0.5 maps to yellow, 0.75 maps to a light green, and 1.0 maps to a darker true green.

# Text wrapping with dot (graphviz)

Recently I have been studying a software application that introduces some novel algorithms for correcting errors in next-generation sequencing data. To better understand exactly what one algorithm was doing, I wrote down the logic (flowchart style) on a piece of paper. This was a very helpful process. However, I wanted to copy a preserve electronically. Since my pseudo-flowchart was pretty much a binary tree, I decided I would create a graphic with the `graphviz` tool.

I learned the bare essentials of graphviz a few years ago, so after looking at some of my old examples I created this `.dot` file.

```digraph
{
node [color=Blue,shape=box]

1.1 [label="Frequency of t exceeds upper threshold"]
2.1 [label="t has d-mutant tiles"]
2.2 [label="Valid"]
3.1 [label="Frequency of t exceeds lower threshold"]
3.2 [label="Frequency of t exceeds lower threshold"]
4.1 [label="Insufficient evidence"]
4.2 [label="Valid"]
4.3 [label="t has only one d-mutant that exceeds lower threshold"]
4.4 [label="Are there any d-mutant tiles with significantly higher frequencies?"]
5.1 [label="Insufficient evidence"]
node [color=Green] 5.2 [label="Correct t to t'"] node [color=Blue]
5.3 [label="t has a d-mutant tile t' that is closer than all other d-mutant tiles and for which a corrected base has a higher quality score"]
5.4 [label="Valid"]
6.1 [label="Insufficient evidence"]
6.2 [label="t' is unique"]
7.1 [label="Insufficient evidence"]
node [color=Green] 7.2 [label="Correct t to t'"] node [color=Blue]

1.1 -> 2.1 [label="no"]
1.1 -> 2.2 [label="yes"]
2.1 -> 3.1 [label="no"]
2.1 -> 3.2 [label="yes"]
3.1 -> 4.1 [label="no"]
3.1 -> 4.2 [label="yes"]
3.2 -> 4.3 [label="no"]
3.2 -> 4.4 [label="yes"]
4.3 -> 5.1 [label="no"]
4.3 -> 5.2 [label="yes"]
4.4 -> 5.3 [label="no"]
4.4 -> 5.4 [label="yes"]
5.3 -> 6.1 [label="no"]
5.3 -> 6.2 [label="yes"]
6.2 -> 7.1 [label="no"]
6.2 -> 7.2 [label="yes"]
}
```

The command to create a `.png` graphic from this file was easy.

`dot -Tpng -o test.png test.dot`

The result was almost what I was looking for, but not quite (click on the image for full-size version).

The problem is that the boxes containing a lot of text are are too wide. Unfortunately, `graphviz` does not allow automatic text wrapping, so you have to control box width by manually inserting `\n` characters into the box labels. Anyone that has done something like this before knows how tedious this process can be, and how frustrating it is to redo when you need to change the label text.

Instead, I wrote a simple Perl script that will automatically add line breaks to labels so that the label width does not exceed some given number.

```#!/usr/bin/perl
use strict;

my \$usage = "setdotlabelwidth [char-width] < [dotfile]";
my \$width = shift() or die("Usage: \$usage \$!");

while(<STDIN>)
{
if(m/label="(.*?)"/)
{
my \$labeltext = \$1;
my @words = split(/ /, \$labeltext);
my @newtext = ();
my \$newline = "";
foreach my \$word(@words)
{
if( length(\$newline) > 0 and
length(\$newline) + length(\$word) > \$width )
{
push(@newtext, \$newline);
\$newline = "";
}
\$newline .= " " if( length(\$newline) > 0 );
\$newline .= \$word;
}
push(@newtext, \$newline) if( length(\$newline) > 0 );
my \$newlabel = join("\\n", @newtext);
s/label=".*?"/label="\$newlabel"/;
}
print;
}
```

I saved this program as `setdotlabelwidth`, ran it, and simply piped the output into `graphviz`. In this case, I set the width to 35 characters.

`./setdotlabelwidth 35 < tile-error-correction.dot | dot -Tpng -o tile-error-correction.png`

Voilà, the graphic looks much nicer now! Again, click on the image for the full-size version.

# Drawing genome annotations with AnnotationSketch

In a previous post, I mentioned the GenomeTools package and, in particular, the AnnotationSketch tool. Here I give a few examples of how easy it is to use and customize.

Let’s say I have three sets of genome annotations for the same genomic region. Running AnnotationSketch with the defaults will give the following result.

```\$ gt sketch maize_default.png zm.chr5.0.yrgate.gff3 zm.chr5.1.cpgat.gff3 zm.chr5.2.msdo.gff3
```

Looking at the graphic sure beats trying to compare 3 GFF3 files, but unfortunately the graphic is still pretty busy. It took me a few days to figure out how to clean it up, but the AnnotationSketch tool is very customizable. Here are the adjustments I made.

• I made the graphic a bit wider using the command-line option `-width`.
• Since there are annotations from three different sources, I thought it would be nice to have a specific color scheme for each source. I copied the default style file from `\$GT_INSTALL_DIR/gtdata/sketch/default.style` and edited it. The style files allow you to apply styles conditionally using callback functions. You can see in the style file where I’ve used the callback functions to apply colors conditionally based on the file from which the feature came. Additionally, I collapsed all features related to protein-coding transcripts into the mRNA tracks and disabled the display of unrelated feature types by setting the respective `max_show_width` properties to 0. This drastically reduced the visual clutter. When I run AnnotationSketch again, I must remember to specify this style file with the `-style` command-line option.
• Finally, I used the `-flattenfiles` to collapse all 3 mRNA tracks into a single track. By default, AnnotationSketch will create a separate track for each feature type from each different file, but since I’ve applied a custom color scheme, the separate tracks are unnecessary.

With all these adjustments, I can rerun AnnotationSketch with the following command, producing a much more pleasing graphic.

```\$ gt sketch -flattenfiles -width 1200 -style gt.three.style maize_improved.png zm.chr5.0.yrgate.gff3 zm.chr5.1.cpgat.gff3 zm.chr5.2.msdo.gff3
```

The style files are quite easy to work with once you’ve done it once. GenomeTools provides some pretty in-depth documentation, including developer tools for C, Lua, Python, and Ruby if extensive customization is required.

# AnnotationSketch and GenomeTools

I have been developing a tool recently that requires visualization of gene structure annotations. My advisor suggested I look at the AnnotationSketch tool from the GenomeTools package, and I have to say that I have been very pleased with the software. GenomeTools is not just for visualization though–it has a variety of built-in genome informatics tools, as well as a development kit with support for several languages. The libraries are very well designed, flexible, well-documented, and efficient (coded in C). I also got prompt and helpful responses from developers through both the bug tracker and the mailing list.

All the functionality I needed for my tool is already built-in to the `gt sketch` tool created when GenomeTools is installed, and I’ll post some simple usage examples soon. However, very extensive customization is possible with the included developer APIs. After the dust settles a bit from some recent submission deadlines, I plan on taking a deeper look into some of the awesome features this software provides.