Reference transcript data set for insects

I participate in several science- and programming-related online forums (as I discussed recently), and recently I’ve seen quite a lot of requests for pointers to some set of DNA sequences that can be used for testing this or that (for example, see this thread). A lot of these requests seem to come from programmers with little biological intuition that simply want/need to play around with some biomolecular sequence data. My first instinct is to tell them to just go to GenBank or RefSeq and download some sequences, but I guess the raw amount of data can be pretty intimidating for anyone that does not know precisely what they are looking for, biologist or not.

This morning I decided to take some time to create a small reference data set. I was planning on making it span most of eukaryotic diversity, but after realizing how long that would take, I decided to simply focus on insects (maybe I’ll do plants later, and then mammals, and so on—it’s much less of a drain if it can be broken up into smaller tasks). The database I created contains transcript sequences for 8 model insect genomes: Aedes aegypti, Atta cephalotes, Apis mellifera, Acyrthosiphon pisum, Bombus impatiens, Drosophila melanogaster, Harpegnathos saltator, and Nasonia vitripennis.

Rather than posting the dataset itself, I’ll go ahead and post the Makefile I used to put together the dataset. Enjoy!


# Full database
$(DBNAME):		aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa
				cat  aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa > $(DBNAME)
				rm aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa

# Download and decompress transcripts for each genome

# Aedes aegypti
				curl -o aaeg-trans.fa.gz
				gunzip aaeg-trans.fa.gz

# Atta cephalotes
				curl -o acep-trans.fa.gz
				gunzip acep-trans.fa.gz

# Apis mellifera
				curl -o amel-trans.fa.gz
				gunzip amel-trans.fa.gz

# Acyrthosiphon pisum
				curl -o apis-trans.fa.gz
				gunzip apis-trans.fa.gz

# Bombus impatiens
				curl -o bimp-trans.fa.gz
				gunzip bimp-trans.fa.gz

# Drosophila melanogaster
				curl -o dmel-trans.fa.gz
				gunzip dmel-trans.fa.gz

# Harpegnathos saltator
				curl -o hsal-trans.fa.gz
				gunzip hsal-trans.fa.gz

# Nasonia vitripennis
				curl -o nvit-trans.fa.gz
				gunzip nvit-trans.fa.gz


I later ran the following commands to standardize the Fasta deflines (headers).

perl -ne 'if(m/^>(AAEL\S+)/){printf(">gnl|Aedes_aegypti|%s\n", $1)}else{print}' < aaeg-trans.fa >> insect-transcripts.fa
perl -ne 's/Acep_1\.0/Atta_cephalotes/; print' < acep-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Apis mellifera/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Apis_mellifera|%s\n", $1)}else{print}' < amel-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Acyrthosiphon pisum/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Acyrthosiphon_pisum|%s\n", $1)}else{print}' < apis-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Bombus impatiens/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Bombus_impatiens|%s\n", $1)}else{print}' < bimp-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/^>(\S+)/){printf(">gnl|Drosophila_melanogaster|%s\n", $1)}else{print}' < dmel-trans.fa >> insect-transcripts.fa 
perl -ne 's/Hsal_3.3/Harpegnathos_saltator/; print' < hsal-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Nasonia vitripennis/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Nasonia_vitripennis|%s\n", $1)}else{print}' < nvit-trans.fa >> insect-transcripts.fa

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s