Category: Sequences

In silico empirical estimation of paired-end insert size

I’m no expert in library preparation for DNA sequencing, but as I understand it, protocols for paired-end and mate-pair sequencing involve a size selection step that selects DNA fragments within a tight target size range. By controlling the size of sequenced DNA fragments (or “inserts”, borrowing terminology from the days when DNA fragments were actually inserted into vectors and clonally amplified), assembly and alignment software are better able to place reads in their correct locations.

Size selection is not perfect though, so even if you select for, say, 500 bp inserts, you’ll end up with a range of insert sizes. To accommodate this, assemblers and read aligners typically ask users to provide a mean and standard deviation for the insert size. What values do you provide? How wide is the length distribution?

I’ll be frankly honest and admit that for most of my career, I’ve simply provided the target insert size as the mean and a “reasonable” value for the standard deviation (please don’t ask me to define “reasonable” 🙂 ). There are, however, better ways to do this. Hopefully your sequencing center used gel electrophoresis to confirm that the size selection step didn’t fail miserably, and the position of the band in that gel, relative to a ladder, should make the mean insert size pretty clear.

But what if your sequencing center did not provide you with this information, or what if you want a second opinion? Are there any bioinformatics approaches that can help?

Sure! We can empirically determine the distribution of insert sizes by aligning the reads to the genome sequence and observing the outer distance of the paired read alignments. The protocol might look something like this.

# Align paired-end reads with BWA
bwa index genome.fa
bwa mem genome.fa reads1.fq reads2.fq > align.sam

# Convert the alignments to BAM format and sort with SAMtools
samtools view -bS align.sam | samtools sort -o align-sorted -

# Collect insert size metrics with Picard
java -jar /path/to/picard.jar CollectInsertSizeMetrics \
    HISTOGRAM_FILE=inserts-hist.pdf \
    INPUT=align-sorted.bam \

# Graphic of the read length distribution will be in `inserts-hist.pdf`.

But wait, you object, I’m doing a de novo genome assembly. You’re telling me that I have to have my genome sequence to determine what insert size I should use for reconstructing my genome sequence?!?!

Yeah, I know. Assembling the genome and mapping the reads to that genome assembly both require you to specify insert sizes, so it’s a bit of a chicken-and-egg problem. However, genome assembly is hard, and it’s unlikely that you’ll get it right on your first try. Assuming that your first assembly attempt isn’t a complete and utter failure, you should be able to use that as a reference to align your reads and determine the insert size distribution, which you can then use for subsequent assemblies.

In fact, I would recommend doing this as soon as possible after receiving your data to ensure there weren’t big issues with the sequencing. If your 10 kb mate pair library turns out to be mostly 300 bp inserts, and you report this to your sequencing center a month after they give you your data, chances are they can do something about it. However, if it takes you 2 years to assemble and annotate the genome, and only then you check this, there’s probably nothing that can be done.

Decoding SAM flags

While the SAM file format has a lot in common with every other tab-delimited format for storing biological data, the bitwise flag column uses a nifty approach that I have rarely if ever seen elsewhere.

The idea behind the bitwise flag is compression. There are 11 boolean flags associated with each entry in a SAM file, but instead of creating a dedicated column for each of these flags, they are all stored in a single column. This not only reduces the amount of space the data occupy on the disk, it also makes the format less cluttered (I mean seriously, more than 8-10 columns for a data point and my brain can’t handle it any more).

Bits in the bitwise flag

The following table (directly from the SAM specification) provides a description of each of the 11 bits in the bitwise flag.

Bit    Description
0x1    template having multiple segments in sequencing
0x2    each segment properly aligned according to the aligner
0x4    segment unmapped
0x8    next segment in the template unmapped
0x10   SEQ being reverse complemented
0x20   SEQ of the next segment in the template being reversed
0x40   the rst segment in the template
0x80   the last segment in the template
0x100  secondary alignment
0x200  not passing quality controls
0x400  PCR or optical duplicate

So how are these 11 values stored in a single column? The SAM format uses the concept of a binary string 11 characters long. Each character (bit) in the string corresponds to one of the 11 flags, and a value of “1” indicates that the flag is set. But rather than storing a binary string of length 11, the SAM format evaluates the string as a binary number and stores the corresponding decimal representation of that number. For example, the number ‘00001001101’ in binary encoding has the same value as the number ’77’ in decimal encoding, which is the value that would be stored in the second column of a SAM entry. Note also that (for some reason) each bit is described in the SAM specification using its hexadecimal representation (i.e., ‘0x10’ = 16 and ‘0x40’ = 64).


If you are examining a SAM/BAM file manually, then this little decimal-to-flag converter on the picard tools website is extremely useful. However, given the typically large size of these files, chances are that most SAM/BAM processing will be done with some kind of program or script.

For my first experiences processing SAM files, I was not accustomed to this type of encoding. It didn’t take long for me to wrap my head around the concept, but my initial approach to decoding bitwise flags definitely felt a bit kludgy. I was essentially converting the decimal numbers into their corresponding binary string representations, and then testing the value of the character (“0” or “1”) at each position of interest in that string.

However, there is a much better way to decode the flags in the bitwise column using bitwise operators (huh, imagine that). Specifically, the ‘bitwise AND’ operator can be used to test whether each flag is set. For example, if you want to test whether a given SAM entry passes quality controls, you would test the ‘0x200’ bit like this.

# This is Perl code, but a single '&' symbol represents 'bitwise AND' in many
# languages.

if($flag & hex("0x200"))
  # pass!
  # fail!

# If you're comfortable converting between hexadecimal and decimal in your head,
# you could just use the decimal representation of the same flag. '0x200' = 512

if($flag & 512)
  # pass!
  # fail!


This post by Damian Kao covers the same concept, and provides a bit more background with regards to converting between different number systems. Could have saved myself a lot of trouble if I had seen this earlier…

DELIMINATE: comparison with gzip, bzip2, and zip

Yesterday, I saw an abstract in my RSS feed for a new compression algorithm called DELIMINATE. Two things from the abstract caught my attention. First, it claimed to be lossless—I’ve seen abstracts for lossy compression algorithms before, but honestly I have no interest in introducing any additional error into any Illumina data, whose quality is already somewhat crappy. It also claimed improvements in compression speed and compression ratio.

I figured I would give the software a try. The download process was nearly painless (I did have to provide my name and email), and the software “just worked” as advertised out-of-the-box. I was able to confirm that the compression is indeed lossless, and that the compression speed and ratio is indeed superior to all the other commonly-used compression programs/algorithms (I recorded a terminal session of the tests in this asciicast). The decompression time, however, was quite a bit slower than all the alternatives. This, in addition to the practical consideration of installing yet another compression program on every single computer you use frequently (and forcing my colleagues to do the same), is the only hesitation I have regarding DELIMINATE.

Trimmomatic for cleaning NGS reads

Trimming and cleaning are important first steps in any analysis involving Illumina short read data. I’m somewhat familiar with the experimental protocols and why trimming, clipping, and filtering need to be done, but honestly I haven’t invested too much effort in familiarizing myself with the details of various implementations. I’ve essentially just used an R script from a colleague thus far for all my NGS preprocessing needs.

This week I came across a link for the Trimmomatic tool, which seems to be a well-implemented and reliable tool. I’m interested in comparing the performance of Trimmomatic versus my R script. I’m assuming Trimmomatic will be much faster and produce results of at least the same quality (if not better), in which case I would be happy to scrap the R script for good!

Error correction with AllPaths-LG

My experience with the AllPaths-LG genome assembler has been great. Anecdotal evidence as well as some rigorous studies both seem to corroborate this. Earlier this week, I came across a link to this post on the AllPaths-LG blog.

The ALLPATHS-LG fragment read error correction modules are now available for stand alone use outside the full assembly pipeline. You can now easily error correct your own data using our method without running an assembly.

I’m glad I saw this when I did. I recently read over Steven Salzberg et al.‘s GAGE evaluation, and remembered the paper mentioning that for many (most?) assemblers, using AllPaths-LG’s error correction module as opposed to QUAKE gave better results. I’m definitely interested to try this out.

Can I estimate genome size from the size of the Fasta file?

TL;DR: It’s fine for a very rough estimate.

There are at least two factors that complicate using the filesize of a Fasta file as a proxy for the genome size. First, there are extra characters in the file that do not represent nucleotides in the genome. A small number of these are found in the Fasta record headers, but most are in the form of invisible newline characters. Using the number of bytes in the file as an estimate of genome size will result in an inflated estimate unless the newlines (and other extra characters) are first removed. Second, when working with large Fasta files, it’s common to discuss things in terms of megabytes or gigabytes, not millions or billions of bytes. This is another complication, since 1 megabyte = 1024 kilobytes = 1024 * 1024 bytes, whereas 1 megabase = 1000 kilobases = 1000 * 1000 base pairs.

All of this of course assumes that the Fasta file contains an accurate assembly of the genome. Of course no assembly is perfectly accurate, but bigger problems with the assembly will yield bigger differences between the actual genome size and the genome size estimated by the Fasta file.

That being said, if you’re interested in just a rough estimate, looking at the filesize will give you a ballpark idea of the filesize. Consider the following.

[standage@gnomic Polistes_dominulus]$ perl < allpathslg05-final.assembly.fasta | perl -e '$total = 0; while(<>){chomp();($id, $length) = split(/,/); $total += $length;}; printf("length: %d\n", $total)'
length: 202344795
[standage@gnomic Polistes_dominulus]$ ls -l allpathslg05-final.assembly.fasta
-rw-------. 1 standage users 204905495 Apr 21 20:30 allpathslg05-final.assembly.fasta
[standage@gnomic Polistes_dominulus]$ ls -lh allpathslg05-final.assembly.fasta
-rw-------. 1 standage users 196M Apr 21 20:30 allpathslg05-final.assembly.fasta

The first command gives the actual size of the genome assembly: 202,344,795 base pairs. The second command gives the size of the Fasta file containing the assembly: 204,905,495 bytes. The third command gives the human-readable size of the Fasta file: 196 megabytes. In each case, as long as you do some generous rounding, you’ll end up with 200 megabases as your estimate.

Reference transcript data set for insects

I participate in several science- and programming-related online forums (as I discussed recently), and recently I’ve seen quite a lot of requests for pointers to some set of DNA sequences that can be used for testing this or that (for example, see this thread). A lot of these requests seem to come from programmers with little biological intuition that simply want/need to play around with some biomolecular sequence data. My first instinct is to tell them to just go to GenBank or RefSeq and download some sequences, but I guess the raw amount of data can be pretty intimidating for anyone that does not know precisely what they are looking for, biologist or not.

This morning I decided to take some time to create a small reference data set. I was planning on making it span most of eukaryotic diversity, but after realizing how long that would take, I decided to simply focus on insects (maybe I’ll do plants later, and then mammals, and so on—it’s much less of a drain if it can be broken up into smaller tasks). The database I created contains transcript sequences for 8 model insect genomes: Aedes aegypti, Atta cephalotes, Apis mellifera, Acyrthosiphon pisum, Bombus impatiens, Drosophila melanogaster, Harpegnathos saltator, and Nasonia vitripennis.

Rather than posting the dataset itself, I’ll go ahead and post the Makefile I used to put together the dataset. Enjoy!


# Full database
$(DBNAME):		aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa
				cat  aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa > $(DBNAME)
				rm aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa

# Download and decompress transcripts for each genome

# Aedes aegypti
				curl -o aaeg-trans.fa.gz
				gunzip aaeg-trans.fa.gz

# Atta cephalotes
				curl -o acep-trans.fa.gz
				gunzip acep-trans.fa.gz

# Apis mellifera
				curl -o amel-trans.fa.gz
				gunzip amel-trans.fa.gz

# Acyrthosiphon pisum
				curl -o apis-trans.fa.gz
				gunzip apis-trans.fa.gz

# Bombus impatiens
				curl -o bimp-trans.fa.gz
				gunzip bimp-trans.fa.gz

# Drosophila melanogaster
				curl -o dmel-trans.fa.gz
				gunzip dmel-trans.fa.gz

# Harpegnathos saltator
				curl -o hsal-trans.fa.gz
				gunzip hsal-trans.fa.gz

# Nasonia vitripennis
				curl -o nvit-trans.fa.gz
				gunzip nvit-trans.fa.gz


I later ran the following commands to standardize the Fasta deflines (headers).

perl -ne 'if(m/^>(AAEL\S+)/){printf(">gnl|Aedes_aegypti|%s\n", $1)}else{print}' < aaeg-trans.fa >> insect-transcripts.fa
perl -ne 's/Acep_1\.0/Atta_cephalotes/; print' < acep-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Apis mellifera/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Apis_mellifera|%s\n", $1)}else{print}' < amel-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Acyrthosiphon pisum/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Acyrthosiphon_pisum|%s\n", $1)}else{print}' < apis-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Bombus impatiens/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Bombus_impatiens|%s\n", $1)}else{print}' < bimp-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/^>(\S+)/){printf(">gnl|Drosophila_melanogaster|%s\n", $1)}else{print}' < dmel-trans.fa >> insect-transcripts.fa 
perl -ne 's/Hsal_3.3/Harpegnathos_saltator/; print' < hsal-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Nasonia vitripennis/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Nasonia_vitripennis|%s\n", $1)}else{print}' < nvit-trans.fa >> insect-transcripts.fa

Parsing Fasta files with C: kseq.h

The C programming language is quickly becoming my favorite. For many tasks, an interpreted language like Perl is still my first choice, since I can usually write the script very fast and superfast performance isn’t always a huge priority. However, now that I am becoming increasingly comfortable and quick coding in C, I’m finding more and more tasks where it’s worth spending a little bit more time coding to get the performance benefit I want.

I appreciate the understanding that comes from implementing everything from scratch, but I also appreciate the common programmer warning not to reinvent the wheel. For example, I often implement data structures from scratch, but I am not quite ready to invest the time required to implement a Fasta parser that is capable of handling all the many different cases that might come up. This wasn’t a big deal for me though, since most of the code I’ve been writing lately is designed to analyze gene annotations in GFF3 format, not genome sequences in Fasta format.

Well, this changed recently. I was working on some code as part of a collaboration with another department. The code needed to be fast (read: implemented in C) and it involved sequence analysis (read: has Fasta parser). I was forced to finally get serious about exploring the different options that exist out in the open source space. There are a few, but I settled with the kseq.h library. The library is essentially a self-contained C header file, so it’s extremely portable. Also, it can handle both Fasta and Fastq input, either in plain text or gzip-compressed. I’ve been pleased so far with this library, and although I’ve only used it for one project, I plan on using it in the future and would recommend it to anyone who’s looking.

Here is some sample source code from the kseq.h home page.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
// STEP 1: declare the type of file handler and the read() function
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
	gzFile fp;
	kseq_t *seq;
	int l;
	if (argc == 1) {
		fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
		return 1;
	fp = gzopen(argv[1], "r"); // STEP 2: open the file handler
	seq = kseq_init(fp); // STEP 3: initialize seq
	while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence
		printf("name: %s\n", seq->name.s);
		if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
		printf("seq: %s\n", seq->seq.s);
		if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
	printf("return value: %d\n", l);
	kseq_destroy(seq); // STEP 5: destroy seq
	gzclose(fp); // STEP 6: close the file handler
	return 0;

Splitting Fasta files by size

Processing DNA or protein sequence files is one of the most common tasks in bioinformatics. Occasionally, the need comes up to split a Fasta file into smaller pieces. This is a task for which I have written many a 5-line Perl script that I used only once. But after today, I don’t think I’ll be writing any more.

I’m a big fan of the GenomeTools toolkit and development library, and today I discovered their splitfasta tool. If you want to split your sequence data into a given number of files, run it with the -numfiles flag. If you want to split your data into files of a certain size, run it with the -targetsize flag. splitfasta usually gives a pretty even distribution of sequence, although occasionally the last file is significantly smaller than the rest–especially when using the -targetsize flag. I saw similar behavior in the scripts that I wrote previously though, so I’m not too concerned.

Just like everything else in the GenomeTools toolkit, splitfasta is implemented in C, so it’s fast. Enjoy!

standage@ubuntu:~/$ ls -lhp
total 98M
-rw-r--r-- 1 standage standage 98M 2012-01-20 08:50 cotton.fasta
standage@ubuntu:~/$ gt splitfasta -numfiles 13 cotton.fasta 
standage@ubuntu:~/$ ls -lhp
total 195M
-rw-r--r-- 1 standage standage  98M 2012-01-20 08:50 cotton.fasta
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.1
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.10
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.11
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.12
-rw-rw-r-- 1 standage standage 7.4M 2012-01-20 10:10 cotton.fasta.13
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.2
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.3
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.4
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.5
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.6
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.7
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.8
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.9
standage@ubuntu:~/$ rm cotton.fasta.*
standage@ubuntu:~/$ gt splitfasta -targetsize 15 cotton.fasta 
standage@ubuntu:~/$ ls -lhp
total 195M
-rw-r--r-- 1 standage standage  98M 2012-01-20 08:50 cotton.fasta
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.1
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.2
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.3
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.4
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.5
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.6
-rw-rw-r-- 1 standage standage 7.1M 2012-01-20 10:11 cotton.fasta.7

How to retrieve GenBank records with a range of accession numbers

Yesterday I was looking at a publication my research group published a few years ago and I found a reference to some data I was looking for.

The ESTs from GR_Ea and GR_Eb were deposited in GenBank under accession nos. CO069431–CO100583 and CO100584–CO132899.]

I went to GenBank and submitted these accession numbers as search terms, but I could only get it to retrieve 1 EST at a time. I did not want to repeat this process thousands of times, so I did some searching and finally got a response on the BioStar forum.

GenBank’s EST database (as well as their other databases) has a URL-based query interface. The answer to my problem was simply knowing how to use this interface (in other words, how to build my URLs). Of course, you don’t have to use URLs–they have a “Search Details” box on the site that allows you to enter your search query directly. Using either interface will retrieve the same results.

To get the first set of records in the publication, I used to URL-based interface and pointed my browser to[accn]. The colon operator is interpreted as the range between the two values, and the [accn] operator indicates you want to retrieve entries whose accession numbers match the provided value(s). Taadaaaa!

This is a relatively simple query, but GenBank allows you to build pretty complex queries and filters. If you check out GenBank’s Entrez Help page, they provide a detailed description of how to build complex queries for retrieving your desired data. Make sure to scroll past their long-winded description of all of the Entrez databases!