Can I estimate genome size from the size of the Fasta file?

TL;DR: It’s fine for a very rough estimate.

There are at least two factors that complicate using the filesize of a Fasta file as a proxy for the genome size. First, there are extra characters in the file that do not represent nucleotides in the genome. A small number of these are found in the Fasta record headers, but most are in the form of invisible newline characters. Using the number of bytes in the file as an estimate of genome size will result in an inflated estimate unless the newlines (and other extra characters) are first removed. Second, when working with large Fasta files, it’s common to discuss things in terms of megabytes or gigabytes, not millions or billions of bytes. This is another complication, since 1 megabyte = 1024 kilobytes = 1024 * 1024 bytes, whereas 1 megabase = 1000 kilobases = 1000 * 1000 base pairs.

All of this of course assumes that the Fasta file contains an accurate assembly of the genome. Of course no assembly is perfectly accurate, but bigger problems with the assembly will yield bigger differences between the actual genome size and the genome size estimated by the Fasta file.

That being said, if you’re interested in just a rough estimate, looking at the filesize will give you a ballpark idea of the filesize. Consider the following.

[standage@gnomic Polistes_dominulus]$ perl < allpathslg05-final.assembly.fasta | perl -e '$total = 0; while(<>){chomp();($id, $length) = split(/,/); $total += $length;}; printf("length: %d\n", $total)'
length: 202344795
[standage@gnomic Polistes_dominulus]$ ls -l allpathslg05-final.assembly.fasta
-rw-------. 1 standage users 204905495 Apr 21 20:30 allpathslg05-final.assembly.fasta
[standage@gnomic Polistes_dominulus]$ ls -lh allpathslg05-final.assembly.fasta
-rw-------. 1 standage users 196M Apr 21 20:30 allpathslg05-final.assembly.fasta

The first command gives the actual size of the genome assembly: 202,344,795 base pairs. The second command gives the size of the Fasta file containing the assembly: 204,905,495 bytes. The third command gives the human-readable size of the Fasta file: 196 megabytes. In each case, as long as you do some generous rounding, you’ll end up with 200 megabases as your estimate.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s