Reference transcript data set for insects

I participate in several science- and programming-related online forums (as I discussed recently), and recently I’ve seen quite a lot of requests for pointers to some set of DNA sequences that can be used for testing this or that (for example, see this thread). A lot of these requests seem to come from programmers with little biological intuition that simply want/need to play around with some biomolecular sequence data. My first instinct is to tell them to just go to GenBank or RefSeq and download some sequences, but I guess the raw amount of data can be pretty intimidating for anyone that does not know precisely what they are looking for, biologist or not.

This morning I decided to take some time to create a small reference data set. I was planning on making it span most of eukaryotic diversity, but after realizing how long that would take, I decided to simply focus on insects (maybe I’ll do plants later, and then mammals, and so on—it’s much less of a drain if it can be broken up into smaller tasks). The database I created contains transcript sequences for 8 model insect genomes: Aedes aegypti, Atta cephalotes, Apis mellifera, Acyrthosiphon pisum, Bombus impatiens, Drosophila melanogaster, Harpegnathos saltator, and Nasonia vitripennis.

Rather than posting the dataset itself, I’ll go ahead and post the Makefile I used to put together the dataset. Enjoy!

DBNAME=insect-db.fa

# Full database
$(DBNAME):		aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa
				cat  aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa > $(DBNAME)
				rm aaeg-trans.fa acep-trans.fa amel-trans.fa apis-trans.fa bimp-trans.fa dmel-trans.fa hsal-trans.fa nvit-trans.fa

# Download and decompress transcripts for each genome

# Aedes aegypti
aaeg-trans.fa:	
				curl -o aaeg-trans.fa.gz ftp://ftp.vectorbase.org/public_data/organism_data/aaegypti/Geneset/aaegypti.TRANSCRIPTS-AaegL1.2.fa.gz
				gunzip aaeg-trans.fa.gz

# Atta cephalotes
acep-trans.fa:	
				curl -o acep-trans.fa.gz http://hymenopteragenome.org/drupal/sites/hymenopteragenome.org.atta/files/data/acep_OGSv1.2_transcript.fa.gz
				gunzip acep-trans.fa.gz

# Apis mellifera
amel-trans.fa:	
				curl -o amel-trans.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Apis_mellifera/RNA/rna.fa.gz
				gunzip amel-trans.fa.gz

# Acyrthosiphon pisum
apis-trans.fa:	
				curl -o apis-trans.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Acyrthosiphon_pisum/RNA/rna.fa.gz
				gunzip apis-trans.fa.gz

# Bombus impatiens
bimp-trans.fa:	
				curl -o bimp-trans.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Bombus_impatiens/RNA/rna.fa.gz
				gunzip bimp-trans.fa.gz

# Drosophila melanogaster
dmel-trans.fa:
				curl -o dmel-trans.fa.gz ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-transcript-r5.44.fasta.gz
				gunzip dmel-trans.fa.gz

# Harpegnathos saltator
hsal-trans.fa:
				curl -o hsal-trans.fa.gz http://hymenopteragenome.org/drupal/sites/hymenopteragenome.org.harpegnathos/files/data/hsal_OGSv3.3_transcript.fa.gz
				gunzip hsal-trans.fa.gz

# Nasonia vitripennis
nvit-trans.fa:	
				curl -o nvit-trans.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Nasonia_vitripennis/RNA/rna.fa.gz
				gunzip nvit-trans.fa.gz

PS

I later ran the following commands to standardize the Fasta deflines (headers).

perl -ne 'if(m/^>(AAEL\S+)/){printf(">gnl|Aedes_aegypti|%s\n", $1)}else{print}' < aaeg-trans.fa >> insect-transcripts.fa
perl -ne 's/Acep_1\.0/Atta_cephalotes/; print' < acep-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Apis mellifera/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Apis_mellifera|%s\n", $1)}else{print}' < amel-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Acyrthosiphon pisum/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Acyrthosiphon_pisum|%s\n", $1)}else{print}' < apis-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Bombus impatiens/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Bombus_impatiens|%s\n", $1)}else{print}' < bimp-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/^>(\S+)/){printf(">gnl|Drosophila_melanogaster|%s\n", $1)}else{print}' < dmel-trans.fa >> insect-transcripts.fa 
perl -ne 's/Hsal_3.3/Harpegnathos_saltator/; print' < hsal-trans.fa >> insect-transcripts.fa 
perl -ne 'if(m/Nasonia vitripennis/){m/gi\|\d+\|ref\|(\S+)\|/ and printf(">gnl|Nasonia_vitripennis|%s\n", $1)}else{print}' < nvit-trans.fa >> insect-transcripts.fa
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s