Literate programming, RStudio, and IPython Notebook

A substantial portion of scientific software is written by scientists with little or no formal training in software engineering. As a result, many scripts and programs are untested, unreliable, a huge pain to install or configure, and unlikely to be accessible for more than a few months or years. One does not need to be a software engineer to write good, useful software. For example, simply adding a reasonably descriptive command-line interface to a script or program does not take much work, and can go a long way in making your program easy to use. Placing the software under version control and adding automated tests also provide a huge benefit without a significant amount of additional effort. An aside: I was recently certified as an instructor for Software Carpentry, which provides excellent 2-day workshops introducing these topics at a beginner level to scientists.

My point is that if you’re writing software to be distributed amongst collaborators and colleagues, there are well established software engineering practices to make this a painless experience for everyone involved, and learning the basics doesn’t require much work.

However, a lot of scientific computing does not involve writing scripts or programs to be used by a wide audience. Often, you need to explore the data. You want to compute a variety of statistics and visualize data distributions. You need to download other data sets for comparison, and you feverishly take notes to make sure you can repeat the analysis later if needed—and let’s just be honest here, of course you’re going to have to redo it later! Exploratory data analysis is a tedious and frustratingly messy process. We often have to throw everything we know about sound software engineering out the window in the name of simply getting crap done. And even if you are able to organize your data and scripts and notes just so, sharing this with colleagues and disclosing details for publication can also be very messy.

For over a year I had been hearing a lot about RStudio and IPython Notebook, but it wasn’t until I finally tried them a couple of months ago that I realized how big of a game changer these tools are for data analysis. Both facilitate a kind of literate programming where natural language (like English prose) can be interspersed among source code and the output produced by executing that code. This facilitates placing an entire analysis in a single easy-to-edit document, which renders beautifully and seamlessly with formatted text, blocks of syntax-highlighted source code, and code output, including both text and graphics. These individual documents are easily placed under version control and shared with colleagues via email, GitHub, or online hosting services like RPubs and IPyNb Viewer.

So how do RStudio and IPython Notebook work? RStudio provides an integrated development environment (IDE) for R programming. In that environment, you can create an Rmd (R Markdown) file for your analysis. You can add text to describe your analysis, and Markdown allows you to format the text with headers, bold, italics, hyperlinks, and so on. Any code that you want to execute is placed in code blocks, and then when you want to execute that code you click the “knit HTML” button. This will create an HTML document with your formatted text, your syntax-highlighted code, and any output produced by the code (whether it be text or a graphic).

The IPython Notebook is pretty similar, except that the notebook runs in your web browser instead of an IDE. Perhaps the biggest functional difference between the two is that in IPython Notebook, you can run and re-run individual code blocks independently and see the results. This is helpful if you want to rapidly test small changes without re-running time intensive steps. In RStudio you must re-knit the entire document if you want to see any changes.

So if it’s not clear already how this solves the problem of messy exploratory data analysis, here’s my perspective. Using IPython notebook, on a whim I can launch into a computing task, and my notebook will keep a faithful record of all that I did. At any point, I can save the notebook and close it, and then come back to it later. I can go back through later and add notes, or clean up any existing notes. I can easily publish it online for my collaborators to see, and if they’re so inclined, they can easily download and edit the notebook themselves!

So whether you prefer R or Python for your data analysis needs, we now have excellent free tools at our disposal to take a lot of the pain out of your scientific computing.

Quotation marks and R’s read.table function

Today I lost about 90 minutes of my life struggling with some strange behavior in R. I had a data file with 209675 rows and 30 columns. Using shell tools I checked and re-checked the number of lines in the file and the number of values in each line, and got the expected 209675 x 30. But when I tried to load the data into R using `read.table`, it created a 105330 x 30 data frame without any explanation, warning, or error.

I tried filling in blank fields with sentinel values, and a variety of other things, but still I could not get a data frame with the correct dimensions.

Finally, I tracked down the source of the issue to several single quotation marks in the file. One of the fields in the data file contained a functional prediction for a transcript, and occasionally these values used the single quote to refer to sequence orientation (such as PREDICTED: bis(5′-nucleosyl)-tetraphosphatase [asymmetrical]-like [Apis florea]). Any time R encountered a single quote, it would consider everything beginning from that quote to the next quote as a single string value.

I have run into problems with quotes before–primarily with SQL files where quotation marks in strings had not been properly escaped. The solution in this case, however, was not to escape the single quotation marks, but just to instruct R to treat them as normal characters.

```data <- read.table("pdom.expression.data.txt", header=TRUE, sep="\t", quote="")
dim(data)
```

Here, the `quote=""` argument is the key. By providing an empty string, I’m essentially telling R that I want to disable the default behavior altogether. If instead I had wanted to preserve the default behavior and just use a different set of quotation delimiters, I would just provide those delimiters as the argument.

```# This would preserve default behavior, but only for double quotation marks

# This would preserve default behavior for double quotation marks and and asterisks
```

R: plot histogram as a curve

I typically use R for visualization, and I am familiar with a few basic visualization types. The simplest way to plot the distribution of a 1-dimensional data set is the histogram, available via the function `hist`. However, sometimes I prefer plotting a curve rather than the bars that `hist` creates. So given some 1-dimensional data, I typically end up using some combination of `sort` and `uniq -c` to convert the data to X and Y values that can be used with R’s `plot` function.

However, in the last couple of days I learned a nifty little shortcut that will save me the trouble in the future. Rather than pre-processing the data, I can store the histogram as an object and retrieve the X and Y values from that object. For example, this week I wanted to look at the distribution of read lengths in some Illumina data I had just cleaned. I used the following commands to plot the distribution as a curve rather than a histogram. No pre-processing required!

```lengths <- read.table("lengthdist.dat")
h <- hist(lengths\$V1, plot=FALSE)
plot(h\$mids, h\$density, type="l", col="red", main="", xlab="Read length", ylab="Frequency")
```

Batch installation of R packages

Recently, I was writing a shell script to automate the installation of some bioinformatics tools on a particular platform. One of the tools depended on the R package “VGAM”. Installing packages manually is trivial—you simply fire up R, type `install.packages("YourPackageName")`, select your closest mirror, and then BAM! you’re done.

I’ve used the shebang-able Rscript before to run R one-liners on the command line (try `Rscript -e 'rnorm(5)'` on your prompt), so I figured batch installation of packages would be just as simple, right? Well, yes and no. If you try to just run the `install.packages` function as is using `Rscript`, the command will fail since no mirror is specified.

```[standage@lappy ~] Rscript -e 'install.packages("VGAM")'
Error in contrib.url(repos, type) :
trying to use CRAN without setting a mirror
Calls: install.packages -> .install.macbinary -> contrib.url
Execution halted
```

A bit of Google searching provided a couple of solutions to this problem: either use the `chooseCRANmirror` function…

```# Use the getCRANmirrors() function for a list of all possible mirrors
chooseCRANmirror(76)
install.packages("VGAM")
```

…or simply pass the mirror URL as the `repos` argument.

```install.packages("VGAM", repos="http://ftp.ussg.iu.edu/CRAN/")
```

I ended up going with the second one, and it worked beautifully!

Plotting log-scale axes in R

Wow, it feels like a long time since I have blogged, but it’s only been a few weeks. I’m teaching a class on computational genome science this semester, and taking another one on the evolution of genes and genomes, so yeah, coursework has been kicking me in the butt the last couple of months. But anyway, enough meta-blogging…

This week I was reading about codon usage bias in Mike Lynch’s excellent book The Origins of Genome Architecture. In the last section of Chapter 6, he highlights the perils of automatically attributing elevated usage of “non-optimal” codons to selection and ignoring the stochastic effects of population genetic processes. Figure 6.9 plots the equilibrium frequency of “optimal” codons given different mutation rates and different values of $2N_gs$ (the power of selection on a gene).

I wanted to use this figure for an assignment. Since it is not readily available in electronic form, and since scanning it in would be lame, and since the equations used to generate the plots are clearly laid out in the book, and since I’m a curious and headstrong scientist…yeah, I had no reason not to generate the plot myself using R. I quickly threw together an R script to plot the graphic.

```# The n variable represents different values of 2N_{g}s
n <- 1

# The x variable is the parameter of the equation and represents u/v, the
# mutation rate from "optimal" to "non-optimal" codon
x <- seq(0, 100, by=.0001)

y <- exp(n)/(exp(n)+x)
plot(x, y, type="l",
xlab="Mutation rate toward non-optimal codon (u/v)",
ylab="Equilibrium frequency of optimal codon (p̃)")

n <- 3
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="blue")

n <- 0
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="red")
```

This script worked, although it was immediately clear why the graphic in the book had used a log scale for the x-axis: my plot didn’t look that great. So I started trying to work out in my head whether I should transform the x values or the y values, and how I was going to properly label the x-axis, when I thought “there has got to be a better way to do this; I bet this is built in to R.” The help page for the `plot` function didn’t give any clues, but a bit more searching helped me find what I was looking for. The `log` option of the `plot` function lets you specify whether to plot the x-axis or the y-axis or both on a log scale (using “x”, “y”, and “xy” as arguments, respectively). I then found a cool trick to label the axis using exponents rather than decimals or the lame default scientific notation. Anyway, here is the updated code…

```# The n variable represents different values of 2N_{g}s
n <- 1

# The x variable is the parameter of the equation and represents u/v, the
# mutation rate from "optimal" to "non-optimal" codon
x <- seq(0, 100, by=.0001)

y <- exp(n)/(exp(n)+x)
plot(x, y, log="x", type="l", xaxt="n", xlim=c(0.01, 100),
xlab="Mutation rate toward non-optimal codon (u/v)",
ylab="Equilibrium frequency of optimal codon (p̃)")
ticks <- seq(-2, 2, by=1)
labels <- sapply(ticks, function(i) as.expression(bquote(10^ .(i))))
axis(1, at=c(0.01, 0.1, 1, 10, 100), labels=labels)

n <- 3
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="blue")

n <- 0
y <- exp(n)/(exp(n)+x)
lines(x, y=y, col="red")
```

…and the final product.

Many thanks to the following StackOverflow threads for pointing me in the right direction.

• On this one, I saw the `log` parameter for the `plot` function.
• On this one, I learned the cool trick for the axis labels.

Exon distribution in the human genome

I have recently been developing some software for comparing gene structure annotations from different sources. Up until last week, I had only tested my software on plant data, and decided that a test on human genome annotations would be good. After implementing and testing a few new features on a small human data set, I ran the software on the entire human genome. About a minute into execution, the program crashed with a segfault. I was justifiably disappointed, since the program had tested well with other data. I had no idea where to begin looking for a bug.

After troubleshooting for some time, I started reviewing the assumptions I make at each stage of the program. As I was doing this, I realized that I was assuming that a gene would have no more than 256 exons. Of course, most genes will have well under 100 exons, but I thought 256 would be a safe upper bound. Incorrect! After a bit of digging, I found that in the GRCh37.61 release, there are 3 genes with more than 256 predicted exons! When my program tried to analyze these genes, it caused memory issues and the program crashed. When I reluctantly increased the limit to 512, the program ran fine.

This got me interested how many genes in the human genome have an extreme number of exons. Of course, at this point it’s not plausible (possible?) to determine whether a gene actually has hundreds of exons or whether such an annotation is the artifact of the gene prediction workflow, but I thought this would be informative nonetheless. I began by firing up R and grabbing some summary statistics of the two latest gene builds for the human genome…

```> counts.61 <- scan("hg61-numexons.txt", integer(0))
> length(counts.61)
[1] 16165
> length(counts.61[counts.61 > 100])
[1] 30
> length(counts.61[counts.61 > 256])
[1] 3
> mean(counts.61)
[1] 12.84955
> median(counts.61)
[1] 9
> var(counts.61)
[1] 172.7238
> quantile(counts.61)
0%  25%  50%  75% 100%
1    5    9   17  291
>
>
> counts.62 <- scan("hg62-numexons.txt", integer(0))
> length(counts.62)
[1] 16165
> length(counts.62[counts.62 > 100])
[1] 11
> length(counts.62[counts.62 > 256])
[1] 0
> mean(counts.62)
[1] 10.73764
> median(counts.62)
[1] 8
> var(counts.62)
[1] 101.2633
> quantile(counts.62)
0%  25%  50%  75% 100%
1    4    8   14  248
```

…and plotting some histograms.

```> png("hg61-numexons.png")
> hist(counts.61[counts.61 < 100], border="red", col="#ffeeee", main="# Exons per Gene (GRCh37.61)", xlab="# Exons", ylab="Frequency")
> dev.off()
null device
1
> png("hg62-numexons.png")
> hist(counts.62[counts.62 < 100], border="blue", col="#eeeeff", main="# Exons per Gene (GRCh37.62)", xlab="# Exons", ylab="Frequency")
> dev.off()
null device
1
```

So it seems that the predicted number of exons per gene is decreasing with each release. I assume this means that the gene prediction workflows are becoming more accurate at predicting true exon splice sites and more capable of discerning true exons from unwanted artifacts (spurious exons, exons from repetitive or transposable sequences, etc).

Axis Labels in R

I use R for generating most of my scientific graphics, and it generally does a pretty good job with the axes by default. However, you are given a lot of flexibility if you want to tweak things and get them “just right.”

Recently, I was plotting a distribution with a very large mean (in the hundreds of thousands range). Printing out something like `350000` every X-axis tick mark would have cluttered the graphic, so R by default printed the tick mark labels in scientific notation. It was definitely less cluttered, but also not very readable. I decided I would prefer labels like `350k` rather than `3.5e5`.

The `axis` command is what I needed. First, I set the `xaxt` parameter to `n` in the initial plot command using the to leave the X axis blank. I then used the axis command to specify where I wanted the axis printed (bottom), the size of intervals between tick marks, and how I wanted to print the labels. This was my final code.

```plot(x, y, type='l', xaxt='n')
axis(1, at = x <- seq(3.5e5, 8e5, by = 5e4),  labels = paste(x/1000, "k", sep = "" ))
```