By the time I had my first real exposure to computational genomics, RNA-Seq was well on its way to supplanting microarrays as the preferred method for measuring mRNA expression at the genome scale. The fundamental idea is pretty simple: short sequence reads obtained from an mRNA sample can be mapped (by sequence similarity) back to a catalog of gene or mRNA sequences, and the number of reads mapping to each molecule then acts as a proxy for that molecule’s expression level in the sample. My first RNA-Seq project involved aligning ~75bp Illumina reads to a comprehensive collection of cotton mRNAs constructed from a sampling of various different cultivars/accessions and tissues. The subsequent differential expression analysis for this particular project was pretty non-standard and involved statistics that are (or at least at the time were) over my head, so my involvement in that facet of the project was primarily technical.
More recently, I’ve been spearheading the analysis of RNA-Seq data from paper wasps. We have 12 samples, 6 from queen wasps and 6 from workers, and the preliminary RNA-Seq analysis to be included in our soon-to-be-submitted genome paper will be a very basic investigation of expression differences between queens and workers. By recommendation of a colleague, I’ve been using RSEM (and Bowtie) for estimating expression levels and EBSeq for identifying differentially expressed genes. I am much more involved in the interpretation of results now than I had been for my undergrad project(s), and accordingly I’ve taken some time to really dig down, understand what these programs are doing under the hood, and check my assumptions.
The first point to address is why it is even necessary to run a program like RSEM to “estimate” expression values. If we’re using the number of reads mapped to a transcript as a proxy for the transcript’s expression level, why not just feed these values directly into EBSeq (or one of a dozen alternative programs)? The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads (multireads for short) entirely. Another involves partitioning and distributing portions of a multiread’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes.
With this basic understanding of how RSEM works (or how it is intended to work), the next point is to investigate whether the software actually works as advertised. How can we verify this? In the ideal case, the expected count estimated by RSEM will be precisely the number of reads mapping to that transcript. However, when counting the number of reads mapped for all transcripts, multireads get counted multiple times, so we can expect that this number will be slightly larger than the expected count for many transcripts.
Expected count values are trivial to obtain from the RSEM output (5th column of the
*.[genes|isoforms].results files), and I’ve written a simple program to count the number of reads mapping to each gene/mRNA given a
.bam file produced by RSEM. Selecting a random subset of mRNAs in my sample and manually checking these values is a quick way to increase my confidence in the reliability of the experiment.
The first time I compared raw reads counts to RSEM’s expected counts, I encountered an unexpected trend: the expected counts were not slightly lower than the raw counts, they were consistently lower by a factor of 2. After thinking about this a bit, I considered the possibility that RSEM treats each pair of reads as a single unit given paired-end data. To confirm this, I selected a small subset of 10 million read pairs from one of my samples and ran RSEM twice: once in paired-end mode, and once in single-end mode disregarding the read pairing information. Consistent with my hypothesis, the expected counts produced in single-end mode were almost exactly 2 times to the expected counts produced in paired-end mode.
Having taken the time to clearly grasp what this program is doing and to verify that it is performing as advertised, I am much more confident in the results of my subsequent analyses moving forward.