My experience with RNA-Seq and differential expression analysis goes back to my last year or two as an undergraduate working in BYU’s plant genetics lab. However, I’ve been spearheading a project recently with a significant RNA-Seq component, which has given me much more practical experience than I had before and has forced me to address some gaps in my understanding of the analysis’ details. One of these details is how data are normalized so as to enable comparison across samples.
Due to a variety of technical factors, the number of RNA-Seq reads sequenced from different samples will not be the same. Therefore, if we want to use read counts as a measure of a gene’s (or transcript’s) expression in a sample, and then compare that measurement of expression across multiple samples, we must adjust each sample’s expression measurements so that we’re “comparing apples to apples” as it were.
A straightforward approach to normalizing the data would be to adjust each sample’s expression estimates by the total number of reads in that sample. Indeed, this was my first thought as to how to address the issue. However, this approach is sensitive to outliers which may have a disproportionate influence on total read count. An alternative method, described in the DESeq paper, is to adjust expression measurements so that the medians of the different samples’ distributions of expression levels line up.
As it turns out, this “MedianNorm” approach is implemented in the DE analysis package I was using (EBSeq) and is used by default. However, I wanted to confirm that normalization was working as expected, so I went ahead and implemented an R script to plot the distributions of expression values before and after normalization for my dozen or so samples. Here is what the distributions look like before normalization…
…and here is what they look like post-normalization.
Needless to say, I had a lot more confidence going forward once I had visually confirmed that the normalization did what I expected it to do.
The R script I used to generate the before and after plots is available on GitHub, and is part of a hodge-podge RNA-Seq analysis toolkit I’ve been cobbling together for the last few months. Having a script like this on hand will be really helpful as I work on similar projects in the future.
- Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 2010, 11:R106, doi:10.1186/gb-2010-11-10-r106.
- Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Stewart RM, Kendziorski C (2013) EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics, 29(8): 1035-1043, doi:10.1093/bioinformatics/btt087.