Category: Bash

Leading zeros with bash brace expansion

I use bash brace expansion extensively, both in interactive terminal sessions and in shell scripts. The idea is pretty simple: the simple syntax {n..m} is a shortcut to an array containing all of the integers from n to m, inclusive. I’ve used this syntax several times in examples on this blog, including a recent post about task automation in bash.

Today one of my social media acquaintances posted the following on Twitter.

The implication is pretty clear: if you want the values in the array to have leading zeros, just add the leading zeros to the first number in the range. I checked that this works for arbitrary ranges, so things like {001..200} and {00001..50000} also work as expected.

This is a very nice feature I didn’t know about before. Even when you’re lucky enough to have a data set that is nicely organized with well-thought-out filenames, it’s a 50/50 chance whether serially numbered files/samples will have leading zeroes or not. This bash feature makes it dead simple to handle both cases trivially!

Advertisements

The fastest darn Fastq decoupling procedure I ever done seen

There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…

The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.

I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.

The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.

I’ve blogged before about how amazingly useful the paste command can be for working with Fastq data. In fact, combining paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!

paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) \
    | tr '\t' '\n' \
    > reads-int.fastq

However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the split-paired-reads.py script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses paste and process substitution, but it also takes advantage of the oft-overlooked tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).

paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

This procedure did the conversion in 2.5 minutes, while I terminated the incomplete split-paired-reads.py procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.


*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.

‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.

Task automation with bash and parallel

Nobody gains significant experience in genome informatics without the drudgery of running the same command(s) over and over again on a set of files. Everybody has a different tolerance level for tedious tasks—I’m often surprised at the patience some people have to sit there and manually re-run the same command dozens of times. Regardless of your pain threshold, however, you’ll eventually run into a data set that is simply too large to be analyzed or processed piece by piece. Luckily, there are several approaches to task automation that 1) are very simple to use and 2) are portable across most UNIX and UNIX-like systems such as Linux and Mac OS X. This post will give a brief introduction into some of the techniques I have picked up over time.

The for loop

The easiest way to automate repetitive command-line tasks is with the bash for loop. Assuming you’ve exercised at least a minimal amount of discipline in naming your data files, it should be straightforward to apply the same command to many different data files.

The basic structure of a for loop is as follows.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt; done

If you’re putting the commands in a script (or describing them on a blog 🙂 ) it often increases readability if you split it across multiple lines like so.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt
done

The code in the middle gets executed once for each value specified on the first line. That’s it! That’s a for loop. If you’d like a more long-winded version, here’s the break down.

  1. for i in 1 2 3 4 5: This syntax is used to tell the shell that we want to repeat a command multiple times using multiple values. In this case the values are the integers 1 – 5, but these could be anything: names, sample labels, whatever. For this specific example, we could use bash range notation and replace the numbers with {1..5}, which is especially useful if you have a large number of serially numbered files (i.e. file1.txt, file2.txt, file3.txt, and so on).
  2. do someCommand data$i.fastq > output$i.txt: This is the command we want to run multiple times. In this particular example, the command will be executed 5 times, and each time the variable $i will be filled in with one of the values we specified earlier (first 1, then 2, etc). This example assumes that we have 5 input files: data1.fastq, data2.fastq, ..., data5.fastq, and will create 5 corresponding output files: output1.txt, output2.txt, ..., output5.txt.
  3. done: This keyword indicates the end of the loop.

There are some additional things you might want to consider.

  • How you name your files can make a big difference. There is no programming trick on earth that can help you if your input files are named like data_1.fastq, data-2.fastq, data3.fastq, data_4.fq, and data_five.fastq. Although it’s easy for us as humans to see what the pattern is, simplifying this so that a computer can easily process it requires you to be consistent with your file naming schemes.
  • Embedding a variable in a filename or command that has underscores can be problematic, since the underscore is a valid character for bash variables. Let’s take the example from above, but imagine instead the input files are named like data_1_trim.fq, data_2_trim.fq, and so on. We might be tempted to wrap someCommand data_$i_trim.fq > output$i.txt in our for loop, but this wouldn’t work. Bash will interpret $i_trim as a variable, instead of just $i as was intended. If there is ever any possibility of confusion, it’s always better to use explicit syntax and wrap variables in braces like so: someCommand data_${i}_trim.fq > output${i}.txt.

The for loop, revisited

In a regular for loop, each command is executed sequentially. That is, the first iteration is executed, bash waits for it to finish, and then only when it is complete does the loop proceed to run the second iteration (and so on for all iterations). If our loop runs 5 command, and each takes 1 minute, then the regular for loop approach will take 5 minutes to run.

A very simple modification allows you to run all of the iterations simultaneously in parallel, without waiting for the first command to finish. This will allow us to run all 5 commands at once, getting all of our results in 1 minute (instead of 5). This is done by placing the ampersand character after the command in the for loop.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt & done

Here it is again in expanded form.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt &
done

There are a couple of things to consider here as well.

  • The ampersand character tells bash to run the command in the background. If you are typing these commands directly into your shell you should have no problems with this. However, if you have stored the commands in a script you will need to include a wait command after the loop. This is because bash scripts do not normally wait for background processes to finish before exiting. The wait command forces the bash script to wait for the loop to finish before exiting.
  • The ampersand provides a very simple and powerful extension of the for loop, but it requires caution. With this approach, each iteration of the loop will spawn a new process, and if the number of spawned processes exceeds the number of available processors/cores, this could lead to performance issues. Only use this approach if you are sure there are more processors on your machine than iterations in your loop.

The GNU parallel command

Although the bash for loop is simple and powerful, there are cases where it doesn’t work too well. This is primarily when you have loops with a large number of iterations and you want to speed up these iterations by using multiple processors, but the number of iterations is much more than the number of processors. For instance, you may have hundreds of samples you want to run some quality control command on, and your desktop has 16 processors. The normal for loop described above is not optimal because it will only use 1 processor at a time. The parallelized for loop described above doesn’t work because it will try to run all of the samples at once, quickly overloading the computer. We need a something that will run the command on all of our hundreds of samples, but only keep 16 processes running at a time. Enter the GNU parallel command.

Let’s continue with the scenario described above, but instead imagine we had 512 input data files instead of 5. Assuming they file names are numbered appropriately, we can process these files, 16 files at a time, with the following command.

parallel --jobs 16 someCommand data{}.fastq '>' output{}.fastq ::: {1..512}

If you take a moment to look at this, it’s very similar to the for loop.

  • Instead of specifying the iteration values at the beginning (1 2 3 4 5 or just {1..5}), we specify them at the end after the triple colon. Alternatively, parallel can read these from standard input or a file.
  • Instead of using a loop variable like $i, we simply use empty curly braces {}.

There are a few considerations to note here as well.

  • Note the file redirect symbol > in quotes. If you do not put this in quotes, it will be interpreted by the shell instead of the parallel command.
  • The GNU parallel command is relatively recent and is not available on many systems by default. If you’re on Linux, it should be fairly easy to find and install using your system’s package manager (apt-get, yum, and so on). On Mac you can install it using Homebrew.
  • Some versions of parallel may require you to add the --gnu flag for correct behavior. I have no idea which versions require this or why. Basically, if the command fails right away with an error message try adding or removing the flag.
  • The parallel command supports multiple arguments per command. This isn’t really helpful for the example discussed above, but check out the man page for parallel, and specifically the description of the -N flag and numbered arguments.

Thanks to oletange for pointing out the parallel file redirection issue.

Writing clean Bash code

The shell is a paradox. It is mysterious in its complexity and marvelous in its simplicity. You can learn the bare basics with just of couple hours of tinkering, but becoming a shell master takes years of consistent use, and even then it feels like you learn something new every day. On one hand, it has an ugly syntax, and trying to work with strings and arrays is beyond frustrating. On the other hand, there’s nothing quite like the zen achieved when you stitch together a half dozen shell commands to extract and format your data of interest, creating a pipeline that outperforms any Python or C++ code you could ever write.

Whatever your feelings about the shell in general, or about (the most popular shell) Bash in particular, you cannot deny that it is a rustic language with very poor debugging facilities that makes it very easy to shoot yourself in the foot. If you can avoid writing code in Bash, you should. In fact, as blogger Jeff Atwood says, if you can avoid writing code at all, you should. But if you’re a modern biologist then chances are that sooner or later the shell will be the best tool for a particular job. What do you do then?

Over the last few months I’ve really tried to look for ways to improve my shell code. Here are some of the resources I’ve found.

  • Defensive Bash Programming: The author of this blog post provides some suggestions for keeping Bash code clean and readable—often a real challenge when stitching together shell commands in a pipeline, for example. He also provides suggestions for organizing the code and using variables carefully to avoid common pitfalls. If I’m writing a simple Bash script with just a few lines, I’d probably ignore most of this advice, but the longer the script gets, the more important it is to take extra steps to be careful and make your intent clear.
  • Shellcheck: So I was telling the truth when I said there’s no debugger for Bash, but this tool does its best to remedy that. If you paste a shell script into Shellcheck, it will analyze the code and point out potential issues or pitfalls.
  • Unofficial Bash strict mode: This is probably my favorite. The Perl language has a strict mode that makes writing and debugging Perl code a whole lot less frustrating. Bash has no such strict mode, but there are some things you can do to enforce safer behavior in your bash scripts. This blog suggests placing two lines at the beginning of every Bash script, and then describes why this is a good idea in a variety of scenarios. I haven’t taken his suggestion fully to heart quite yet, but I have begun placing set -eo pipefail at the beginning of most of my Bash scripts, and I have already saved myself tons of heartache and wasted time by doing so.

So the next time you absolutely, positively have to write some shell code, consider referring to these resources to improve the experience!

Bash tricks: getopts and die signal handling

For better and for worse, Perl has been my scripting go-to language for years. I’ve since learned Python, and can appreciate why it has won so many crazed evangelists enthusiasts in the programming community (in general) and the scientific computing community (in particular). However, I’m all about using the best most convenient tool for the job, and sometimes the best glue for Your Little Bioinformatics Tool is a makefile, or even just a simple little shell script.

Recently I was writing a bash script to implement a very simple procedure, stringing together the results of several calls to small scripts and programs I had written. As is typical for bash scripts I have written in the past, I used positional command-line arguments for any values I needed to adjust on a run-by-run basis, and then accessed these in the script using the variables $1, $2, and so on.

As I started running the script to do my analyses, I began thinking I wish there was a better way to do this, to make some arguments optional but some required—something like getopts. Well, a simple Google search solved that one for me. A few minutes later, I had put a nice command-line interface on my bash script. The syntax is really pretty simple.

# Usage statement
print_usage()
{
  cat <<EOF
Usage: $0 [options] genomeseq.fasta annotation.gff3
  Options:
    -c    some important cutoff value; default is 0.2
    -d    debug mode
    -h    print this help message and exit
    -o    file to which output will be written; default is 'ylt.txt'
    -t    home directory for YourLittleTool; default is '/usr/local/src/YourLittleTool'
EOF
}

# Command-line option parsing
CUTOFF=0.2
DEBUG=0
YLTHOME="/usr/local/src/YourLittleTool"
OUTFILE="ylt.txt"
while getopts "c:dho:t:" OPTION
do
  case $OPTION in
    c)
      CUTOFF=$OPTARG
      ;;
    d)
      DEBUG=1
      ;;
    h)
      print_usage
      exit 0
      ;;
    o)
      OUTDIR=$OPTARG
      ;;
    t)
      YLTHOME=$OPTARG
      ;;
  esac
done

# Remove arguments associated with options
shift $((OPTIND-1))

# Verify the two required positional arguments are there
if [[ $# != 2 ]]; then
  echo -e "error: please provide 2 input files (genome sequence file (Fasta format) and annotation file (GFF3 format))\n"
  print_usage
  exit 1
fi
FASTA=$1
GFF3=$2

# Now implement the procedure of  your little tool

So on one hand, this does add quite a bit to a bash script that originally had only 4-8 lines of logic. But on the other hand, with not too much work on my part, it now has a convenient and self-documented interface that makes it much easier in case someone else in my lab (or if I’m so lucky, someone “out there”) wants to use it in the future.

As I was sprucing up the bash script, I also decided to investigate another feature I was interested in. This particular procedure creates a new directory, into which several data files, graphics, and HTML to reports are written. If the procedure failed and prematurely terminated, I wanted the default behavior to be that the output directory gets deleted so as not to interfere with subsequent run attempts (and of course I provided an option not to delete the output on failure, which is essential for troubleshooting bugs in the pipeline). I had already added set -e to the script, which kills execution of the script if any command returns an unsuccessful status. While this is very convenient, it could potentially have made it pretty complicated to delete incomplete output at different stages of the pipeline.

Enter trap. This keyword is meant to associate a handler function with various signals, one of which is the ERR signal which is fired when a bash script terminates with an error.

die_handler()
{
  if [[ !($DEBUG) ]]; then
    rm -r $OUTDIR
  fi
}
trap die_handler ERR

The trap statement above essentially says in case of an error causing premature script termination, run the die_handler function.

I’ve always considered bash scripts to be pretty hackish, and I’m not sure this experience has completely changed that opinion (lipstick on a pig?). However, for this particular case I was very happy I was able to combine the convenience of a bash script with the flexibility and power provided by getopts and event-based error handling.