# Understanding RSEM: raw read counts vs expected counts

By the time I had my first real exposure to computational genomics, RNA-Seq was well on its way to supplanting microarrays as the preferred method for measuring mRNA expression at the genome scale. The fundamental idea is pretty simple: short sequence reads obtained from an mRNA sample can be mapped (by sequence similarity) back to a catalog of gene or mRNA sequences, and the number of reads mapping to each molecule then acts as a proxy for that molecule’s expression level in the sample. My first RNA-Seq project involved aligning ~75bp Illumina reads to a comprehensive collection of cotton mRNAs constructed from a sampling of various different cultivars/accessions and tissues. The subsequent differential expression analysis for this particular project was pretty non-standard and involved statistics that are (or at least at the time were) over my head, so my involvement in that facet of the project was primarily technical.

More recently, I’ve been spearheading the analysis of RNA-Seq data from paper wasps. We have 12 samples, 6 from queen wasps and 6 from workers, and the preliminary RNA-Seq analysis to be included in our soon-to-be-submitted genome paper will be a very basic investigation of expression differences between queens and workers. By recommendation of a colleague, I’ve been using RSEM (and Bowtie) for estimating expression levels and EBSeq for identifying differentially expressed genes. I am much more involved in the interpretation of results now than I had been for my undergrad project(s), and accordingly I’ve taken some time to really dig down, understand what these programs are doing under the hood, and check my assumptions.

The first point to address is why it is even necessary to run a program like RSEM to “estimate” expression values. If we’re using the number of reads mapped to a transcript as a proxy for the transcript’s expression level, why not just feed these values directly into EBSeq (or one of a dozen alternative programs)? The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads (multireads for short) entirely. Another involves partitioning and distributing portions of a multiread’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.

With this basic understanding of how RSEM works (or how it is intended to work), the next point is to investigate whether the software actually works as advertised. How can we verify this? In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.

Expected count values are trivial to obtain from the RSEM output (5th column of the *.[genes|isoforms].results files), and I’ve written a simple program to count the number of reads mapping to each gene/mRNA given a .bam file produced by RSEM. Selecting a random subset of mRNAs in my sample and manually checking these values is a quick way to increase my confidence in the reliability of the experiment.

The first time I compared raw reads counts to RSEM’s expected counts, I encountered an unexpected trend: the expected counts were not slightly lower than the raw counts, they were consistently lower by a factor of 2. After thinking about this a bit, I considered the possibility that RSEM treats each pair of reads as a single unit given paired-end data. To confirm this, I selected a small subset of 10 million read pairs from one of my samples and ran RSEM twice: once in paired-end mode, and once in single-end mode disregarding the read pairing information. Consistent with my hypothesis, the expected counts produced in single-end mode were almost exactly 2 times to the expected counts produced in paired-end mode.

Having taken the time to clearly grasp what this program is doing and to verify that it is performing as advertised, I am much more confident in the results of my subsequent analyses moving forward.

# Normalization for differential expression analysis

My experience with RNA-Seq and differential expression analysis goes back to my last year or two as an undergraduate working in BYU’s plant genetics lab. However, I’ve been spearheading a project recently with a significant RNA-Seq component, which has given me much more practical experience than I had before and has forced me to address some gaps in my understanding of the analysis’ details. One of these details is how data are normalized so as to enable comparison across samples.

Due to a variety of technical factors, the number of RNA-Seq reads sequenced from different samples will not be the same. Therefore, if we want to use read counts as a measure of a gene’s (or transcript’s) expression in a sample, and then compare that measurement of expression across multiple samples, we must adjust each sample’s expression measurements so that we’re “comparing apples to apples” as it were.

A straightforward approach to normalizing the data would be to adjust each sample’s expression estimates by the total number of reads in that sample. Indeed, this was my first thought as to how to address the issue. However, this approach is sensitive to outliers which may have a disproportionate influence on total read count. An alternative method, described in the DESeq paper[1], is to adjust expression measurements so that the medians of the different samples’ distributions of expression levels line up.

As it turns out, this “MedianNorm” approach is implemented in the DE analysis package I was using (EBSeq[2]) and is used by default. However, I wanted to confirm that normalization was working as expected, so I went ahead and implemented an R script to plot the distributions of expression values before and after normalization for my dozen or so samples. Here is what the distributions look like before normalization…

…and here is what they look like post-normalization.

Needless to say, I had a lot more confidence going forward once I had visually confirmed that the normalization did what I expected it to do.

The R script I used to generate the before and after plots is available on GitHub, and is part of a hodge-podge RNA-Seq analysis toolkit I’ve been cobbling together for the last few months. Having a script like this on hand will be really helpful as I work on similar projects in the future.

# Permutation testing for differential expression analysis

Permutation testing is a pretty common non-parametric statistical test used to test how unlikely a particular outcome is. It involves calculating a statistic of interest from your data, followed by several rounds of randomly shuffling sample labels and recalculating the test statistic. Observing the proportion of times a random outcome is as extreme as the originally calculated outcome helps to get a sense for how likely that outcome is the result of random chance.

I’ve recently been involved in an RNA-Seq analysis, particularly looking at differential expression in paper wasps. With 6 biological replicates each from 2 different castes (queens and workers), I measured expression values and used these to identify genes expressed differentially between the castes. There were a few concerns with the preliminary analyses, so in addition to exploring the data in a bit more depth we also wanted to test the reliability of the analysis I had just done. Enter the permutation test.

The idea here is simple. The first step is to run the differential expression analysis with the correct sample labels (which I already did). Next, randomly shuffle the sample labels so that some “queen” samples are labeled “worker” and vice versa. Now, run the differential expression analysis again and note the number of genes designated as differentially expressed. Then simply repeat the label shuffling and re-analysis enough times to get a good idea of what’s going on.

The expectation is that if I label some of my “queens” as “workers” and some of my “workers” as “queens”, and then try to look for consistent queen-vs-worker differences in gene expression, the mislabeled samples will cancel each other out to a large extent and confound the analysis. Due to technical and biological variation, we expect to still identify some differentially expressed genes even with shuffled labels, but nowhere near the number we identify when using the correct labeling. If random permutations of the sample labels frequently produce similar number of differentially expressed genes as the correct labeling, there may be issues with the data that need additional attention. This was the case with my analysis, and I wonder how many scientists out there conduct RNA-Seq experiments without doing such basic quality assessment as this.

The script I implemented to facilitate permutation testing for my RNA-Seq analysis is available at https://github.com/standage/dept. The good news is that many of the most popular differential expression analysis packages are freely available from R/Bioconductor and use a similar input format, so if your data are already formatted for differential expression analysis there’s a good chance this script will work out-of-the-box for you.

# Random sampling with SQLite3

I’m working on a post about efficient random sampling with C, but today I came up with a pretty straightforward solution using SQLite3. Since the dataset I had in mind may be too big to fit into a reasonable amount of memory, and SQLite3 has a documented C API, that other post may never come to fruition. We’ll see…

Anyway, imagine you have some data in a SQLite3 database.

standage@localhost: ~\$ sqlite3 temp.db
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> CREATE TABLE ids (id int primary key);
sqlite> .import data.txt ids
sqlite> SELECT * FROM ids;
1
2
3
4
5
6
7
8
9
10


Create an additional column to hold the value indicating whether each value has been sampled.

sqlite> ALTER TABLE ids ADD COLUMN subset int not null default 0;
sqlite> SELECT * FROM ids;
1|0
2|0
3|0
4|0
5|0
6|0
7|0
8|0
9|0
10|0


Now here’s the magic. This example samples 5 values at random. If you want to sample more or less , then just change that number and voilà!

sqlite> UPDATE ids SET subset = 1 WHERE id IN (SELECT id FROM ids ORDER BY RANDOM() LIMIT 5);
sqlite> SELECT * FROM ids;
1|0
2|0
3|1
4|1
5|0
6|1
7|0
8|1
9|1
10|0


Actually, there’s no magic at all, it’s quite simple.

# Notation for HMM algorithms

I was exposed briefly to Hidden Markov Models (HMMs) as an undergraduate, but now as a graduate student I have revisited HMMs in much more detail (I need to know it “forward” and “backward”, if you catch my drift). Our instructor taught us with very simple examples (two hidden states, very short sequences) and in each case, the notation he used was specific to each example. This made it easy to grasp the concepts initially, but I’m not sure I would have been able to implement more complicated HMMs with only that notation to help me. In preparing for a recent exam, however, I’ve finally come across some notation that is more generalizable. I’ll use it here to describe the forward, backward, and Viterbi algorithms.

## Terminology

First let us provide some formal definitions. We have an observed sequence $O = (O_1 O_2 ... O_N)$ where $O_i \in A = \{ a_1 a_2 ... a_r \}$ for all $i \in 1,2,...,N$. For each $O_i$, there is an unobserved state $Q_i \in S = \{ S_1, S_2, ..., S_s \}$ forming the corresponding hidden state sequence $Q = (Q_1 Q_2 ... Q_N)$. Additionally we define initial state probabilities $\pi$ (where $\pi_{S_i}$ is the probability of starting a sequence in state $S_i$), transition probabilities $\tau$ (where $\tau_{S_i S_j}$ is the probability of moving from state $S_i$ to state $S_j$), and emission (or output) probabilities $\theta$ (where $\theta_{a_i | S_j}$ is the probability of observing the symbol $a_i$ given that we know the corresponding hidden state is $S_j$).

## Forward algorithm

The forward algorithm provides the most straightforward way to determine the probability of an observed sequence given the HMM (i.e. $Prob\{O_1 O_2 ... O_N\}$). This algorithm introduces a probability $F_{i,j} = Prob\{O_1 O_2 ... O_i, Q_i = S_j\}$, which we can conceptualize as an $s \times N$ matrix. We initialize the first column of this matrix like so.

$F_{1,j} = \pi_{S_j}\theta_{O_1 | S_j}$

We can then continue to fill out the rest of the matrix using the following recurrence.

$F_{i,j} = \sum_{k=1}^{s} F_{i-1,k} \tau_{S_k S_j} \theta_{O_i | S_j}$

Once the matrix is filled out, we can get our desired probability $Prob\{O\} = Prob\{O_1 O_2 ... O_N\}$ by summing over the values in the last column of the matrix.

$Prob\{O\} = \sum_{k=1}^{s} F_{N, k}$

## Backward algorithm

The backward algorithm can also be used to find the probability of an observed sequence, but it is most useful in combination with the forward algorithm to determine the probability of a state at a specific position given an output sequence.

The backward algorithm introduces a probability $B_{i,j} = Prob\{O_{i+1} O_{i+2} ...O_N | Q_i = S_j\}$, which we can also conceptualize as an $s \times N$ matrix. We first initialize the last column of the matrix like so.

$B_{N, j} = 1$

Then, as the algorithm’s name suggests, we work backwards to fill out the rest of the matrix using the following recurrence.

$B_{i,j} = \sum_{k=1}^{s} B_{i+1,k}\tau_{S_j S_k}\theta_{O_{i+1}|S_k}$

## Viterbi algorithm

The Viterbi algorithm is a slight variation of the forward algorithm that answers a slightly different question: what is the most probable hidden state sequence given an observed sequence (in formal terms, which hidden state sequence $Q$ maximizes the probability $Prob\{Q | O\}$)?

Here we introduce another probability which we can conceptualize as an $s \times N$ matrix.

$V_{i,j} = \max_{Q_1...Q_i}Prob\{O_1 ... O_i, Q_1 ... Q_{i-1}, Q_i = S_j\}$

We initialize this matrix exactly the same way we do in the forward algorithm.

$V_{1,j} = \pi_{S_j}\theta_{O_1 | S_j}$

We then continue to fill out the rest of the matrix using the following recurrence.

$V_{i,j} = \theta_{O_i | S_j} \max_{k=\{1 ... s\}}V_{i-1,k}\tau_{S_k S_j}$

We also maintain a matrix of pointers $V'$ where $V'_{i,j}$ indicates the value of k which maximizes $V_{i-1,k}\tau_{S_k S_j}$ (used for backtracing). To find the most probable hidden state sequence, identify the highest value in the last column of the matrix (or $\max_{j=1 ... s} V_{N,j}$) and use the matrix of pointers $V'$ to trace backward and identify the sequence of states that produced the maximal probability.

# Random sequence generator

One way to test the probability of any given feature of a genomic sequence is to determine how often that feature occurs in sequences randomly sampled from some given “sequence space.” Determining an appropriate sequence space from which to sample isn’t a trivial task, but a common approach is to match the nucleotide composition of a given sequence (or set of sequences). Alternatively, one can use a model in which the probability of seeing a specific nucleotide at a specific position is dependent on one or more preceding nucleotides. This statistical model is called a Markov model, and it has applications to many areas of research.

I wrote a random sequence generator that uses Markov models to generate any number of random sequences. The program takes a model file (containing initial state probabilities and transition probabilities) as input and generates random sequences (the user can control the number of sequences and the length of each sequence). I also wrote a program to train a kth-order Markov model given one or more sequences (in Fasta format). The output of this program can then be used as input for the random sequence generator. Source code for these programs can be found at https://github.com/standage/Statistics-568/tree/master/sequence_spacer.

In my genome informatics class, we discussed the concept of waiting time–how long you expect to proceed along a sequence until you find a given feature. We derived an analytical solution for the expected waiting time for the feature “ATG” in a sequence where As, Cs, Gs, and Ts are evenly distributed. I then used my random sequence generator to create 5000 random sequences (each of length 1000) whose expected base composition was equal for As, Cs, Gs, and Ts. I then determined the position of the first occurrence of “ATG” in each random sequence. I found that the average waiting time was 63.41, very close to the analytical solution of 64. This is just one example of how computation and simulation can help address an important biological question. This type of approach is especially useful when analytical solutions are not readily available.