Category: RNA-Seq

Understanding RSEM: raw read counts vs expected counts

By the time I had my first real exposure to computational genomics, RNA-Seq was well on its way to supplanting microarrays as the preferred method for measuring mRNA expression at the genome scale. The fundamental idea is pretty simple: short sequence reads obtained from an mRNA sample can be mapped (by sequence similarity) back to a catalog of gene or mRNA sequences, and the number of reads mapping to each molecule then acts as a proxy for that molecule’s expression level in the sample. My first RNA-Seq project involved aligning ~75bp Illumina reads to a comprehensive collection of cotton mRNAs constructed from a sampling of various different cultivars/accessions and tissues. The subsequent differential expression analysis for this particular project was pretty non-standard and involved statistics that are (or at least at the time were) over my head, so my involvement in that facet of the project was primarily technical.

More recently, I’ve been spearheading the analysis of RNA-Seq data from paper wasps. We have 12 samples, 6 from queen wasps and 6 from workers, and the preliminary RNA-Seq analysis to be included in our soon-to-be-submitted genome paper will be a very basic investigation of expression differences between queens and workers. By recommendation of a colleague, I’ve been using RSEM (and Bowtie) for estimating expression levels and EBSeq for identifying differentially expressed genes. I am much more involved in the interpretation of results now than I had been for my undergrad project(s), and accordingly I’ve taken some time to really dig down, understand what these programs are doing under the hood, and check my assumptions.

The first point to address is why it is even necessary to run a program like RSEM to “estimate” expression values. If we’re using the number of reads mapped to a transcript as a proxy for the transcript’s expression level, why not just feed these values directly into EBSeq (or one of a dozen alternative programs)? The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads (multireads for short) entirely. Another involves partitioning and distributing portions of a multiread’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.

With this basic understanding of how RSEM works (or how it is intended to work), the next point is to investigate whether the software actually works as advertised. How can we verify this? In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.

Expected count values are trivial to obtain from the RSEM output (5th column of the *.[genes|isoforms].results files), and I’ve written a simple program to count the number of reads mapping to each gene/mRNA given a .bam file produced by RSEM. Selecting a random subset of mRNAs in my sample and manually checking these values is a quick way to increase my confidence in the reliability of the experiment.

The first time I compared raw reads counts to RSEM’s expected counts, I encountered an unexpected trend: the expected counts were not slightly lower than the raw counts, they were consistently lower by a factor of 2. After thinking about this a bit, I considered the possibility that RSEM treats each pair of reads as a single unit given paired-end data. To confirm this, I selected a small subset of 10 million read pairs from one of my samples and ran RSEM twice: once in paired-end mode, and once in single-end mode disregarding the read pairing information. Consistent with my hypothesis, the expected counts produced in single-end mode were almost exactly 2 times to the expected counts produced in paired-end mode.

Having taken the time to clearly grasp what this program is doing and to verify that it is performing as advertised, I am much more confident in the results of my subsequent analyses moving forward.


Normalization for differential expression analysis

My experience with RNA-Seq and differential expression analysis goes back to my last year or two as an undergraduate working in BYU’s plant genetics lab. However, I’ve been spearheading a project recently with a significant RNA-Seq component, which has given me much more practical experience than I had before and has forced me to address some gaps in my understanding of the analysis’ details. One of these details is how data are normalized so as to enable comparison across samples.

Due to a variety of technical factors, the number of RNA-Seq reads sequenced from different samples will not be the same. Therefore, if we want to use read counts as a measure of a gene’s (or transcript’s) expression in a sample, and then compare that measurement of expression across multiple samples, we must adjust each sample’s expression measurements so that we’re “comparing apples to apples” as it were.

A straightforward approach to normalizing the data would be to adjust each sample’s expression estimates by the total number of reads in that sample. Indeed, this was my first thought as to how to address the issue. However, this approach is sensitive to outliers which may have a disproportionate influence on total read count. An alternative method, described in the DESeq paper[1], is to adjust expression measurements so that the medians of the different samples’ distributions of expression levels line up.

As it turns out, this “MedianNorm” approach is implemented in the DE analysis package I was using (EBSeq[2]) and is used by default. However, I wanted to confirm that normalization was working as expected, so I went ahead and implemented an R script to plot the distributions of expression values before and after normalization for my dozen or so samples. Here is what the distributions look like before normalization…

…and here is what they look like post-normalization.

Needless to say, I had a lot more confidence going forward once I had visually confirmed that the normalization did what I expected it to do.

The R script I used to generate the before and after plots is available on GitHub, and is part of a hodge-podge RNA-Seq analysis toolkit I’ve been cobbling together for the last few months. Having a script like this on hand will be really helpful as I work on similar projects in the future.


  1. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 2010, 11:R106, doi:10.1186/gb-2010-11-10-r106.
  2. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Stewart RM, Kendziorski C (2013) EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics, 29(8): 1035-1043, doi:10.1093/bioinformatics/btt087.

Permutation testing for differential expression analysis

Permutation testing is a pretty common non-parametric statistical test used to test how unlikely a particular outcome is. It involves calculating a statistic of interest from your data, followed by several rounds of randomly shuffling sample labels and recalculating the test statistic. Observing the proportion of times a random outcome is as extreme as the originally calculated outcome helps to get a sense for how likely that outcome is the result of random chance.

I’ve recently been involved in an RNA-Seq analysis, particularly looking at differential expression in paper wasps. With 6 biological replicates each from 2 different castes (queens and workers), I measured expression values and used these to identify genes expressed differentially between the castes. There were a few concerns with the preliminary analyses, so in addition to exploring the data in a bit more depth we also wanted to test the reliability of the analysis I had just done. Enter the permutation test.

The idea here is simple. The first step is to run the differential expression analysis with the correct sample labels (which I already did). Next, randomly shuffle the sample labels so that some “queen” samples are labeled “worker” and vice versa. Now, run the differential expression analysis again and note the number of genes designated as differentially expressed. Then simply repeat the label shuffling and re-analysis enough times to get a good idea of what’s going on.

The expectation is that if I label some of my “queens” as “workers” and some of my “workers” as “queens”, and then try to look for consistent queen-vs-worker differences in gene expression, the mislabeled samples will cancel each other out to a large extent and confound the analysis. Due to technical and biological variation, we expect to still identify some differentially expressed genes even with shuffled labels, but nowhere near the number we identify when using the correct labeling. If random permutations of the sample labels frequently produce similar number of differentially expressed genes as the correct labeling, there may be issues with the data that need additional attention. This was the case with my analysis, and I wonder how many scientists out there conduct RNA-Seq experiments without doing such basic quality assessment as this.

The script I implemented to facilitate permutation testing for my RNA-Seq analysis is available at The good news is that many of the most popular differential expression analysis packages are freely available from R/Bioconductor and use a similar input format, so if your data are already formatted for differential expression analysis there’s a good chance this script will work out-of-the-box for you.