Understanding RSEM: raw read counts vs expected counts

By the time I had my first real exposure to computational genomics, RNA-Seq was well on its way to supplanting microarrays as the preferred method for measuring mRNA expression at the genome scale. The fundamental idea is pretty simple: short sequence reads obtained from an mRNA sample can be mapped (by sequence similarity) back to a catalog of gene or mRNA sequences, and the number of reads mapping to each molecule then acts as a proxy for that molecule’s expression level in the sample. My first RNA-Seq project involved aligning ~75bp Illumina reads to a comprehensive collection of cotton mRNAs constructed from a sampling of various different cultivars/accessions and tissues. The subsequent differential expression analysis for this particular project was pretty non-standard and involved statistics that are (or at least at the time were) over my head, so my involvement in that facet of the project was primarily technical.

More recently, I’ve been spearheading the analysis of RNA-Seq data from paper wasps. We have 12 samples, 6 from queen wasps and 6 from workers, and the preliminary RNA-Seq analysis to be included in our soon-to-be-submitted genome paper will be a very basic investigation of expression differences between queens and workers. By recommendation of a colleague, I’ve been using RSEM (and Bowtie) for estimating expression levels and EBSeq for identifying differentially expressed genes. I am much more involved in the interpretation of results now than I had been for my undergrad project(s), and accordingly I’ve taken some time to really dig down, understand what these programs are doing under the hood, and check my assumptions.

The first point to address is why it is even necessary to run a program like RSEM to “estimate” expression values. If we’re using the number of reads mapped to a transcript as a proxy for the transcript’s expression level, why not just feed these values directly into EBSeq (or one of a dozen alternative programs)? The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads (multireads for short) entirely. Another involves partitioning and distributing portions of a multiread’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.

With this basic understanding of how RSEM works (or how it is intended to work), the next point is to investigate whether the software actually works as advertised. How can we verify this? In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.

Expected count values are trivial to obtain from the RSEM output (5th column of the *.[genes|isoforms].results files), and I’ve written a simple program to count the number of reads mapping to each gene/mRNA given a .bam file produced by RSEM. Selecting a random subset of mRNAs in my sample and manually checking these values is a quick way to increase my confidence in the reliability of the experiment.

The first time I compared raw reads counts to RSEM’s expected counts, I encountered an unexpected trend: the expected counts were not slightly lower than the raw counts, they were consistently lower by a factor of 2. After thinking about this a bit, I considered the possibility that RSEM treats each pair of reads as a single unit given paired-end data. To confirm this, I selected a small subset of 10 million read pairs from one of my samples and ran RSEM twice: once in paired-end mode, and once in single-end mode disregarding the read pairing information. Consistent with my hypothesis, the expected counts produced in single-end mode were almost exactly 2 times to the expected counts produced in paired-end mode.

Having taken the time to clearly grasp what this program is doing and to verify that it is performing as advertised, I am much more confident in the results of my subsequent analyses moving forward.

Advertisements

5 comments

  1. Mafalda Ferreira

    Hi Daniel! I really like your blog and your thoughts have helped me shed some light into what I’m trying to do, that is differential expression analysis with RNA-seq. I’m exactly at the step of using RSEM output as input that into edgeR and DESeq (at least, that was what I was planning). I’ve read somewhere that edgeR/DESeq don’t work well, or at least they expect that the user provides raw read counts instead of expected counts, such as the ones given by RSEM. If the user wants to do it anyway, the expected counts have to be rounded into integer values. Since you’ve used EBSeq I was wondering if you have ever tried edgeR or DESeq and what where your thoughts on rounding the values and on their performance compared to EBSeq.
    Thanks a lot!

  2. Daniel Standage

    I have run edgeR and DESeq, and the amount of overlap between the differentially expressed genes labeled by the 3 different programs is was depressing. However, I did this at an earlier stage before I had really taken the time to understand what the programs are doing, so I’m not sure how the expected vs raw counts will affect DESeq or edgeR.

  3. Leonardo - Unicamp/BR

    Hi Daniel, thanks for the explanation!
    Could you help me about the smr tool? I installed (make) and run ./smr /path_to_sam_file/file.sam -o test .
    It just showed me “segmentation fault” and didn’t produce any result. What could be happened?
    Very thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s