Task automation with bash and parallel

Nobody gains significant experience in genome informatics without the drudgery of running the same command(s) over and over again on a set of files. Everybody has a different tolerance level for tedious tasks—I’m often surprised at the patience some people have to sit there and manually re-run the same command dozens of times. Regardless of your pain threshold, however, you’ll eventually run into a data set that is simply too large to be analyzed or processed piece by piece. Luckily, there are several approaches to task automation that 1) are very simple to use and 2) are portable across most UNIX and UNIX-like systems such as Linux and Mac OS X. This post will give a brief introduction into some of the techniques I have picked up over time.

The for loop

The easiest way to automate repetitive command-line tasks is with the bash for loop. Assuming you’ve exercised at least a minimal amount of discipline in naming your data files, it should be straightforward to apply the same command to many different data files.

The basic structure of a for loop is as follows.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt; done

If you’re putting the commands in a script (or describing them on a blog 🙂 ) it often increases readability if you split it across multiple lines like so.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt
done

The code in the middle gets executed once for each value specified on the first line. That’s it! That’s a for loop. If you’d like a more long-winded version, here’s the break down.

  1. for i in 1 2 3 4 5: This syntax is used to tell the shell that we want to repeat a command multiple times using multiple values. In this case the values are the integers 1 – 5, but these could be anything: names, sample labels, whatever. For this specific example, we could use bash range notation and replace the numbers with {1..5}, which is especially useful if you have a large number of serially numbered files (i.e. file1.txt, file2.txt, file3.txt, and so on).
  2. do someCommand data$i.fastq > output$i.txt: This is the command we want to run multiple times. In this particular example, the command will be executed 5 times, and each time the variable $i will be filled in with one of the values we specified earlier (first 1, then 2, etc). This example assumes that we have 5 input files: data1.fastq, data2.fastq, ..., data5.fastq, and will create 5 corresponding output files: output1.txt, output2.txt, ..., output5.txt.
  3. done: This keyword indicates the end of the loop.

There are some additional things you might want to consider.

  • How you name your files can make a big difference. There is no programming trick on earth that can help you if your input files are named like data_1.fastq, data-2.fastq, data3.fastq, data_4.fq, and data_five.fastq. Although it’s easy for us as humans to see what the pattern is, simplifying this so that a computer can easily process it requires you to be consistent with your file naming schemes.
  • Embedding a variable in a filename or command that has underscores can be problematic, since the underscore is a valid character for bash variables. Let’s take the example from above, but imagine instead the input files are named like data_1_trim.fq, data_2_trim.fq, and so on. We might be tempted to wrap someCommand data_$i_trim.fq > output$i.txt in our for loop, but this wouldn’t work. Bash will interpret $i_trim as a variable, instead of just $i as was intended. If there is ever any possibility of confusion, it’s always better to use explicit syntax and wrap variables in braces like so: someCommand data_${i}_trim.fq > output${i}.txt.

The for loop, revisited

In a regular for loop, each command is executed sequentially. That is, the first iteration is executed, bash waits for it to finish, and then only when it is complete does the loop proceed to run the second iteration (and so on for all iterations). If our loop runs 5 command, and each takes 1 minute, then the regular for loop approach will take 5 minutes to run.

A very simple modification allows you to run all of the iterations simultaneously in parallel, without waiting for the first command to finish. This will allow us to run all 5 commands at once, getting all of our results in 1 minute (instead of 5). This is done by placing the ampersand character after the command in the for loop.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt & done

Here it is again in expanded form.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt &
done

There are a couple of things to consider here as well.

  • The ampersand character tells bash to run the command in the background. If you are typing these commands directly into your shell you should have no problems with this. However, if you have stored the commands in a script you will need to include a wait command after the loop. This is because bash scripts do not normally wait for background processes to finish before exiting. The wait command forces the bash script to wait for the loop to finish before exiting.
  • The ampersand provides a very simple and powerful extension of the for loop, but it requires caution. With this approach, each iteration of the loop will spawn a new process, and if the number of spawned processes exceeds the number of available processors/cores, this could lead to performance issues. Only use this approach if you are sure there are more processors on your machine than iterations in your loop.

The GNU parallel command

Although the bash for loop is simple and powerful, there are cases where it doesn’t work too well. This is primarily when you have loops with a large number of iterations and you want to speed up these iterations by using multiple processors, but the number of iterations is much more than the number of processors. For instance, you may have hundreds of samples you want to run some quality control command on, and your desktop has 16 processors. The normal for loop described above is not optimal because it will only use 1 processor at a time. The parallelized for loop described above doesn’t work because it will try to run all of the samples at once, quickly overloading the computer. We need a something that will run the command on all of our hundreds of samples, but only keep 16 processes running at a time. Enter the GNU parallel command.

Let’s continue with the scenario described above, but instead imagine we had 512 input data files instead of 5. Assuming they file names are numbered appropriately, we can process these files, 16 files at a time, with the following command.

parallel --jobs 16 someCommand data{}.fastq '>' output{}.fastq ::: {1..512}

If you take a moment to look at this, it’s very similar to the for loop.

  • Instead of specifying the iteration values at the beginning (1 2 3 4 5 or just {1..5}), we specify them at the end after the triple colon. Alternatively, parallel can read these from standard input or a file.
  • Instead of using a loop variable like $i, we simply use empty curly braces {}.

There are a few considerations to note here as well.

  • Note the file redirect symbol > in quotes. If you do not put this in quotes, it will be interpreted by the shell instead of the parallel command.
  • The GNU parallel command is relatively recent and is not available on many systems by default. If you’re on Linux, it should be fairly easy to find and install using your system’s package manager (apt-get, yum, and so on). On Mac you can install it using Homebrew.
  • Some versions of parallel may require you to add the --gnu flag for correct behavior. I have no idea which versions require this or why. Basically, if the command fails right away with an error message try adding or removing the flag.
  • The parallel command supports multiple arguments per command. This isn’t really helpful for the example discussed above, but check out the man page for parallel, and specifically the description of the -N flag and numbered arguments.

Thanks to oletange for pointing out the parallel file redirection issue.

Advertisements

4 comments

  1. oletange

    This example:

    parallel --jobs 16 someCommand data{}.fastq > output{}.fastq ::: {1..512}
    

    should be:

    parallel --jobs 16 someCommand data{}.fastq '>' output{}.fastq ::: {1..512}
    

    Otherwise the > will not be seen by GNU Parallel.

  2. Pingback: Leading zeros with bash brace expansion | BioWize
    • Daniel Standage

      The {1..512} part is called a brace expansion: it’s essentially a shortcut for a numerical range. If you type echo {5..9} in your terminal, it prints 5 6 7 8 9.

      The ::: part separates the command (to be executed by parallel) from the arguments. The command gets executed once for each argument, which replaces the {} placeholder, as described in the blog post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s