In silico empirical estimation of paired-end insert size

I’m no expert in library preparation for DNA sequencing, but as I understand it, protocols for paired-end and mate-pair sequencing involve a size selection step that selects DNA fragments within a tight target size range. By controlling the size of sequenced DNA fragments (or “inserts”, borrowing terminology from the days when DNA fragments were actually inserted into vectors and clonally amplified), assembly and alignment software are better able to place reads in their correct locations.

Size selection is not perfect though, so even if you select for, say, 500 bp inserts, you’ll end up with a range of insert sizes. To accommodate this, assemblers and read aligners typically ask users to provide a mean and standard deviation for the insert size. What values do you provide? How wide is the length distribution?

I’ll be frankly honest and admit that for most of my career, I’ve simply provided the target insert size as the mean and a “reasonable” value for the standard deviation (please don’t ask me to define “reasonable” 🙂 ). There are, however, better ways to do this. Hopefully your sequencing center used gel electrophoresis to confirm that the size selection step didn’t fail miserably, and the position of the band in that gel, relative to a ladder, should make the mean insert size pretty clear.

But what if your sequencing center did not provide you with this information, or what if you want a second opinion? Are there any bioinformatics approaches that can help?

Sure! We can empirically determine the distribution of insert sizes by aligning the reads to the genome sequence and observing the outer distance of the paired read alignments. The protocol might look something like this.

# Align paired-end reads with BWA
bwa index genome.fa
bwa mem genome.fa reads1.fq reads2.fq > align.sam

# Convert the alignments to BAM format and sort with SAMtools
samtools view -bS align.sam | samtools sort -o align-sorted -

# Collect insert size metrics with Picard
java -jar /path/to/picard.jar CollectInsertSizeMetrics \
    HISTOGRAM_FILE=inserts-hist.pdf \
    INPUT=align-sorted.bam \

# Graphic of the read length distribution will be in `inserts-hist.pdf`.

But wait, you object, I’m doing a de novo genome assembly. You’re telling me that I have to have my genome sequence to determine what insert size I should use for reconstructing my genome sequence?!?!

Yeah, I know. Assembling the genome and mapping the reads to that genome assembly both require you to specify insert sizes, so it’s a bit of a chicken-and-egg problem. However, genome assembly is hard, and it’s unlikely that you’ll get it right on your first try. Assuming that your first assembly attempt isn’t a complete and utter failure, you should be able to use that as a reference to align your reads and determine the insert size distribution, which you can then use for subsequent assemblies.

In fact, I would recommend doing this as soon as possible after receiving your data to ensure there weren’t big issues with the sequencing. If your 10 kb mate pair library turns out to be mostly 300 bp inserts, and you report this to your sequencing center a month after they give you your data, chances are they can do something about it. However, if it takes you 2 years to assemble and annotate the genome, and only then you check this, there’s probably nothing that can be done.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s