Splitting Fasta files by size

Processing DNA or protein sequence files is one of the most common tasks in bioinformatics. Occasionally, the need comes up to split a Fasta file into smaller pieces. This is a task for which I have written many a 5-line Perl script that I used only once. But after today, I don’t think I’ll be writing any more.

I’m a big fan of the GenomeTools toolkit and development library, and today I discovered their splitfasta tool. If you want to split your sequence data into a given number of files, run it with the -numfiles flag. If you want to split your data into files of a certain size, run it with the -targetsize flag. splitfasta usually gives a pretty even distribution of sequence, although occasionally the last file is significantly smaller than the rest–especially when using the -targetsize flag. I saw similar behavior in the scripts that I wrote previously though, so I’m not too concerned.

Just like everything else in the GenomeTools toolkit, splitfasta is implemented in C, so it’s fast. Enjoy!

standage@ubuntu:~/$ ls -lhp
total 98M
-rw-r--r-- 1 standage standage 98M 2012-01-20 08:50 cotton.fasta
standage@ubuntu:~/$ gt splitfasta -numfiles 13 cotton.fasta 
standage@ubuntu:~/$ ls -lhp
total 195M
-rw-r--r-- 1 standage standage  98M 2012-01-20 08:50 cotton.fasta
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.1
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.10
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.11
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.12
-rw-rw-r-- 1 standage standage 7.4M 2012-01-20 10:10 cotton.fasta.13
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.2
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.3
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.4
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.5
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.6
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.7
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.8
-rw-rw-r-- 1 standage standage 7.5M 2012-01-20 10:10 cotton.fasta.9
standage@ubuntu:~/$ rm cotton.fasta.*
standage@ubuntu:~/$ gt splitfasta -targetsize 15 cotton.fasta 
standage@ubuntu:~/$ ls -lhp
total 195M
-rw-r--r-- 1 standage standage  98M 2012-01-20 08:50 cotton.fasta
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.1
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.2
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.3
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.4
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.5
-rw-rw-r-- 1 standage standage  16M 2012-01-20 10:11 cotton.fasta.6
-rw-rw-r-- 1 standage standage 7.1M 2012-01-20 10:11 cotton.fasta.7
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s