I have recently been developing some software for comparing gene structure annotations from different sources. Up until last week, I had only tested my software on plant data, and decided that a test on human genome annotations would be good. After implementing and testing a few new features on a small human data set, I ran the software on the entire human genome. About a minute into execution, the program crashed with a segfault. I was justifiably disappointed, since the program had tested well with other data. I had no idea where to begin looking for a bug.
After troubleshooting for some time, I started reviewing the assumptions I make at each stage of the program. As I was doing this, I realized that I was assuming that a gene would have no more than 256 exons. Of course, most genes will have well under 100 exons, but I thought 256 would be a safe upper bound. Incorrect! After a bit of digging, I found that in the GRCh37.61 release, there are 3 genes with more than 256 predicted exons! When my program tried to analyze these genes, it caused memory issues and the program crashed. When I reluctantly increased the limit to 512, the program ran fine.
This got me interested how many genes in the human genome have an extreme number of exons. Of course, at this point it’s not plausible (possible?) to determine whether a gene actually has hundreds of exons or whether such an annotation is the artifact of the gene prediction workflow, but I thought this would be informative nonetheless. I began by firing up R and grabbing some summary statistics of the two latest gene builds for the human genome…
> counts.61 <- scan("hg61-numexons.txt", integer(0)) Read 16165 items > length(counts.61)  16165 > length(counts.61[counts.61 > 100])  30 > length(counts.61[counts.61 > 256])  3 > mean(counts.61)  12.84955 > median(counts.61)  9 > var(counts.61)  172.7238 > quantile(counts.61) 0% 25% 50% 75% 100% 1 5 9 17 291 > > > counts.62 <- scan("hg62-numexons.txt", integer(0)) Read 16165 items > length(counts.62)  16165 > length(counts.62[counts.62 > 100])  11 > length(counts.62[counts.62 > 256])  0 > mean(counts.62)  10.73764 > median(counts.62)  8 > var(counts.62)  101.2633 > quantile(counts.62) 0% 25% 50% 75% 100% 1 4 8 14 248
…and plotting some histograms.
> png("hg61-numexons.png") > hist(counts.61[counts.61 < 100], border="red", col="#ffeeee", main="# Exons per Gene (GRCh37.61)", xlab="# Exons", ylab="Frequency") > dev.off() null device 1 > png("hg62-numexons.png") > hist(counts.62[counts.62 < 100], border="blue", col="#eeeeff", main="# Exons per Gene (GRCh37.62)", xlab="# Exons", ylab="Frequency") > dev.off() null device 1
So it seems that the predicted number of exons per gene is decreasing with each release. I assume this means that the gene prediction workflows are becoming more accurate at predicting true exon splice sites and more capable of discerning true exons from unwanted artifacts (spurious exons, exons from repetitive or transposable sequences, etc).