Working with lots of files

I’ve worked with plenty of large-scale data sets before, but these are typically large-scale in the sense of file size. Recently, I worked with a large-scale data set that was quite different: nucleotide and protein sequence data for tens of thousands (maybe even hundreds of thousands) of plant species, stored in hundreds of thousands of files in the same directory.

I needed to do some basic management tasks such as indexing the data for BLAST searches. My typical approach to highly repetitive tasks is to use a bash for loop–something like this.

for db in *.fasta
do
  makeblastdb -in $db -type nucl -parse_seqids
done

However, this did not work with so many files in the directory. In fact, my entire approach to file management was absolutely useless. I couldn’t even use ls to list the files or see how many there were!

[standage@bgdata PlantDNA]$ ls *.fasta | wc -l
-bash: /bin/ls: Argument list too long
0

I did some searching on Google and found a variety of solutions. The one I like best though involves the find command.

find . -type f -name '*.fasta' -exec makeblastdb -in {} -dbtype nucl -parse_seqids \;

That did the trick!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s