Category: Programming

October 2, 2015

Assert statements: software crumple zones

Like many bioinformaticians, I have very little formal training in software engineering. As an undergrad, I got a computer science minor and completed coursework in data structures, algorithms, logic, and programming, but never in the art and craft of software engineering. I feel confident in saying that I have come a long way in grad school—despite the lack of incentives for good software engineering in academic research—but what I know now I’ve learned from reading blogs and StackOverflow posts, from using and contributing to open source software projects, and from good old-fashioned trial-and-error.

In the early days of my software development experience, I learned the value of assert statements. Assertions allow you to explicitly declare assumptions about a program’s state at a particular point in time, and will terminate the program immediately if those assumptions are violated. Liberal use of assertions, coupled with automated unit tests and functional tests, help you to quickly identify where things are wrong when your software isn’t working.

Also in my early days, I picked up on the (supposed) axiom that

Assert Statements Should Never Be Used in Production Code™

While the value of assertions during development is undisputed, crashing a website or a mobile app in production is proclaimed as a serious software engineering cardinal sin. Thus, the idea is that you use assertions while writing the software, hope and pray that your automated tests cover a sufficiently broad set of use cases (hint: they don’t), and then when you’re ready to publish your app or launch your website you disable the assertions so that your users are spared the (cyber-)carnage of an all-out crash when your software fails.

A recent post on Twitter (by a professional software engineer working for Apple, I think) got me thinking about this again.

Thinking about whether leaving assertions enabled in production to avoid things worse than crashes is analogous to crumple zones in cars

— David Smith (@Catfish_Man) September 10, 2015

I like the analogy with crumple zones in cars, which are designed to collapse in case of an accident and absorb impact to protect more vital parts of the vehicle. If you follow the tweet and read the replies, it’s clear that many software developers are beginning to recognize the damage mitigation potential of asserts (even in production settings) and that there are worse things than crashing, especially when it comes to database integrity or (even more importantly) sensitive user data.

Furthermore, scientific research software is not the same as the software powering dating websites or mobile phone apps (shocking, I know). The target user of most phone and web apps is often your average shopper, single adult, or business person. The target user of most scientific software, however, is one of a relatively small number of scientists with related research interests and the necessary background to explore a particular set of research questions. Informative assert statements go a long way toward helping you and your colleagues understand whether the software failed because of corrupt/erroneous data, because of a bug in the code, or because of faulty assumptions made by the programmer. And while it’s frustrating to have software crash in any situation, assertions give you the opportunity to crash the program gracefully rather than continuing in an undefined state and running the risk of crashing later with cryptic messages like “Segmentation Fault”, or worse, not crashing at all. There is nothing more terrifying as a scientist than to base your conclusions on incomplete or flawed data that the software should have warned you about.

So to those just getting started in writing scientific research software, I make the following recommendations.

Use assert statements liberally

Use them at the beginning of functions to test your assumptions about the function arguments/parameters. Use them in the middle of functions to test your assumptions about the results of intermediate computations. Use them at the end of functions to test your assumptions about return values. Use them anywhere and everywhere you can to make your assumptions about the data and code explicit.

Provide informative messages

Having an assertion is usually better than not having an assertion, but a good message can make a huge difference in whether anybody can make sense of the error–including you, after not having looked at the code for 3 months, and trying to re-run it while addressing reviewer comments on your manuscript.

This is what I would call “good”.

Assertion failed: (ni == ne - 1), function do_infer, file main.c, line 193

This is what I would call “better”.

Assertion failed: (num_introns == num_exons - 1), function infer_gene_structure,
    file main.c, line 193

And this is what I would call “best”.

Error: while inferring gene structure, found 7 exons and 5 introns (expected 6).
Please check your annotations for directly abutting exon features.
Assertion failed: (num_introns == num_exons - 1), function infer_gene_structure,
    file main.c, line 193

Provide contact info and/or links to a bug tracker

Your software’s source code is already hosted on a service like GitHub or BitBucket, right? If so, your software already has a bug tracker. Include a link to the bug tracker so that users can easily contact you regarding issues they have. I generally prefer using GitHub’s issue tracker to email, since the bug report is public and I can refer to it with patches and pull requests. But in the very least, the users should know how to contact the author of the software in case they run into trouble.

August 5, 2015

Reproducible software behavior with random seeds

Selecting a random subset of data or generating random numbers is a fairly common bioinformatics programming task. However, verifying correct behavior of software with a random component can be challenging for the uninitiated. Presumably the scientist writing the code would run the software on a handful of small examples and manually check to ensure the output is correct. But it’s not realistic to do this every time the software runs. How can one verify the behavior of such a program in an automated fashion? Alternatively, what if you find a case in which the program produces incorrect output? How do you reproduce that specific case for troubleshooting and testing?

The behavior of a programming language’s random number generator and related features can be predictable with the use of a random seed. Initializing a random number generator with a seed ensures that the same “random” numbers are produced each time. For example, if you have a program that samples 500 paired-end reads from random locations of a 100 kb sequence of genomic DNA, running the program multiple times with the same random seed should produce the exact same 500 reads. You don’t want this kind of behavior in a production end-user environment, but for development and testing this turns out to be very useful.

When it comes to writing research software that has a random component, I would make the following recommendations.

Always include an option to set the random seed. Whether your software is a simple R function, a command-line Python script, or a robust C++ library, the interface should always include an option for the user to set the random seed. This allows them (and you) the ability to reproduce specific cases and troubleshoot or verify the software’s behavior.
Always report the random seed. Whether or not the user provides a specific random seed, reporting the seed used is crucial for reproducing the software’s behavior. When the code does not explicitly set the seed, programming languages will typically use the current system time to set the seed internally, and it’s not always possible to determine the exact value used. Therefore, when the end user does not specify a specific random seed to use, a good approach is to generate a random number, report that random number to the user, and then re-initialize the random number generator using that value as a seed. Subsequent invocations of that program could then reproduce the behavior by using the same seed.

Here is an example: a Python script that does an admittedly trivial task involving random numbers, but demonstrates how to get predictable behavior with random seeds.

#!/usr/bin/env python

import argparse
import random
import sys

# Define command-line interface
parser = argparse.ArgumentParser(description='Generate random integers')
parser.add_argument('-m', '--max', type=int, default=100,
                    help='maximum value; default is 100')
parser.add_argument('-s', '--seed', type=int, help='random seed')
parser.add_argument('n', type=int, help='# of random integers to generate')
args = parser.parse_args()

# Pick a random seed if the user did not provide one
if not args.seed:
    args.seed = random.randint(0, sys.maxint)
# Set and report the random seed
random.seed(args.seed)
print >> sys.stderr, 'Random seed:', args.seed

# Do the trivial task
print [random.randint(1, args.max) for i in range(args.n)]

And here is a demonstration of how the script works.

June 18, 2015

Pandas: querying a data frame and the `SettingWithCopyWarning` message

I’m doing more and more of my data analysis with Python Pandas these days, but I’m far from an expert and still much more comfortable with R. The reason I persist, though, is that Python provides a complete data processing and analysis stack. In the past, I’ve typically used Perl or Python scripts to pre-process data, R to do statistics and visualization, and shell scripts to glue it all together. Python’s rapidly maturing libraries for numerics, statistics, and data analytics, together with an active community and the awesome IPython notebook, make the prospect of doing end-to-end analysis completely in Python quite compelling.

That said, I still have a lot to learn regarding Pandas. Earlier today I was doing some simple operations—taking a subset of a data frame and creating a new column by applying a transformation to another column—and I kept getting a cryptic SettingWithCopyWarning message. Here’s a dummy example that reproduces the message.

>>> # Load data into memory
>>> import pandas
>>> exons = pandas.io.parsers.read_table('test.dat')
>>> exons
                 Species  ID  Length   Context
0        Equus monoceros  e1     162      stop
1          Homo diutinus  e2     111     start
2  Draconis occidentalis  e3      51       cds
3    Basiliscus vulgaris  e4     114  complete
4          Equus pegasus  e5      95       utr
>>>
>>> # Subset the data
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')]
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> import math
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
__main__:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
>>>

I went on to plot the data and there didn’t seem to be any problems. I also followed the link provided in the warning message and read the documentation. If I’m reading things correctly, the warning is issued to inform users when they might be operating on a copy of the original data, rather than the original data itself. In this example, if I make changes to the ungulates variable and want those changes reflected in the original exons variable, then I do not want to be operating on a copy of the data but the original data itself.

However, in my case it wasn’t really a huge concern, so I was able to get rid of the warning message by making my intent to work on a copy of the data explicit.

>>> # Subset the data, but make an explicit copy
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')].copy()
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
>>>
>>> # No errors this time, woohoo!
>>>

If you’re interested in seeing the interactive Python interpreter in action, check out this asciicast.

May 28, 2015

Leading zeros with bash brace expansion

I use bash brace expansion extensively, both in interactive terminal sessions and in shell scripts. The idea is pretty simple: the simple syntax {n..m} is a shortcut to an array containing all of the integers from n to m, inclusive. I’ve used this syntax several times in examples on this blog, including a recent post about task automation in bash.

Today one of my social media acquaintances posted the following on Twitter.

I've only just learned that the shell range syntax {a..b} works with leading zeros! ❯❯❯ echo {8..10} 8 9 10 ❯❯❯ echo {08..10} 08 09 10

— Shaun Jackman (@sjackman) May 28, 2015

The implication is pretty clear: if you want the values in the array to have leading zeros, just add the leading zeros to the first number in the range. I checked that this works for arbitrary ranges, so things like {001..200} and {00001..50000} also work as expected.

This is a very nice feature I didn’t know about before. Even when you’re lucky enough to have a data set that is nicely organized with well-thought-out filenames, it’s a 50/50 chance whether serially numbered files/samples will have leading zeroes or not. This bash feature makes it dead simple to handle both cases trivially!

May 21, 2015

GitHub now renders IPython/Jupyter Notebooks!

I’ve written before about literate programming and how I think this could be a big game changer for transparency and reproducibility in science, especially when it comes to data analysis (vs more traditional software engineering). Well, GitHub announced recently that IPython/Jupyter notebooks stored in GitHub repositories will be rendered, rather than presented as raw JSON text as before. This is a very nice development, making it even easier to share data analysis results with others!

April 21, 2015

Debugging in C

When I break out the GNU debugger (gdb command), it is unwillingly. I am dragged, kicking and screaming inside, to the realization that I have a substantial memory management issue in my C program that no amount of print statements will help me wrap my mind around. And only rarely does gdb actually help me isolate my issue. More often than not, I sink lots of time into that dark abyss, only to return to my print statements and banging my head against the wall until I recognize the problem and necessary solution.

Call me old fashioned, but the vast majority of my debugging is done with print statements. I’d like to think they’re thoughtfully placed to maximize diagnostic power, but honestly it’s usually just a messy process that sometimes generates more confusion before it yields enlightenment. Notwithstanding, this is still the quickest and most effective approach I’ve found to troubleshooting.

I recently came across a small utility that assists with debugging in C. It’s a single header file that will drop in easily to any C project, with a non-restrictive MIT license that permits any use (including commercial) as long as attribution is made to the original author.

What do I like about this little utility? What does it give me over my trusty fprintf statements? Not a whole lot honestly, but what it does provide is nice.

As mentioned before, it’s dead simple to integrate. Just drop it into your includes directory and use it immediately.
It uses some C11 magic to grab and print the variable name for you. More than half of the time I spend writing debugging print statements is spent formatting strings like ID=%s length=%d, so having this handled automagically is a huge potential time saver.
All debug statements can be disabled with a simple #define statement.

Most of the programming I’ve been doing recently has been in Python, so I haven’t integrated this into any active projects yet. But I think I’ll give this a shot for the reasons mentioned above.

March 26, 2015

The fastest darn Fastq decoupling procedure I ever done seen

There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…

The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.

I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.

The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.

I’ve blogged before about how amazingly useful the paste command can be for working with Fastq data. In fact, combining paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!

paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) \
    | tr '\t' '\n' \
    > reads-int.fastq

However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the split-paired-reads.py script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses paste and process substitution, but it also takes advantage of the oft-overlooked tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).

paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

This procedure did the conversion in 2.5 minutes, while I terminated the incomplete split-paired-reads.py procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.

*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.

‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.

March 23, 2015

sequniq: tools for identifying and managing duplicated sequences

In next-generation sequence data, duplicates of non-biological origin (such as PCR duplicates) are fairly common and can complicate tasks such as assembly and expression quantification, and jeopardize the accuracy of the results. If reads are sampled at random from a genome, it’s very unlikely that two reads will be sampled from exactly the same position. For paired-end data especially, there is essentially no other explanation for two read pairs being identical other than some kind of amplification or other technical artifact. Obviously, it’s important to scan our NGS data for duplicates and remove them as part of our preliminary quality control. This is a pretty simple task, so software solutions should abound, right? Right!?!?

Well, it’s very possible that I’m missing something, but I’ve only been able to find a single tool for removal of duplicate reads (FastUniq, described in this PLoS ONE paper). I did see tools for removal of duplicate read alignments, but that’s not going to help much if you’re working on a de novo assembly. I went ahead and downloaded FastUniq, ran it on my data, and verified the results. The user interface was a bit kludgy, but boy have I seen worse!! At least it was well documented and worked as advertised.

Later, I was working with two different annotated genome assemblies for the same organism, which I had parsed into individual loci based on gene content. I was interested in determining how many loci were strictly identical between the two assemblies. As I started writing code to do this task, I realized how similar it was to the simple task of removing duplicate reads. In duplicate read removal, we store read sequences in memory, and then whenever we encounter a sequence we’ve already seen before we discard it. Now, for identifying identical loci, the task was similar, but instead I wanted to report “duplicates” instead of discarding them¹.

After realizing this, I took a few minutes to extend and generalize my code so that it could handle both use cases. I then tested it on a data set I had just processed with FastUniq. My code ran in 29 minutes, whereas FastUniq had required 69 minutes. This does not include the time required to convert between paired and interleaved data (FastUniq requires paired files, my code requires interleaved), nor does it include the time required to shuffle the FastUniq output². So some simple of-the-cuff Python code was able to outperform the only tool I could find for removing duplicates de novo.

I wrapped up my code as a Python package and posted it on Github. It still needs a bit of polishing, which I may or may not ever get around to doing.

¹Because the latter approach involves storing longer genomic sequences, I wrote the code to store SHA1 hashes of each sequence, rather than the sequence itself. This drastically reduced memory consumption without an appreciable increase in runtime.

²For some reason that I did not understand from my very cursory skimming of the FastUniq paper, the FastUniq tool includes a sorting step. Therefore, the first read of the output is the read with the most consecutive adenines (As) at the beginning of the read. Looking at the output set off all kinds of red flags and alarms in my brain. Anyone who has taken a course on data structures or algorithms knows that weird things can happen performance-wise when unsorted data is expected but sorted data is provided (and all NGS analysis tools expect unsorted data). I would not be comfortable using FastUniq’s output directly without first shuffling the data, which only increases the margin of improvement of my Python code over FastUniq!

March 23, 2015

Task automation with bash and parallel

Nobody gains significant experience in genome informatics without the drudgery of running the same command(s) over and over again on a set of files. Everybody has a different tolerance level for tedious tasks—I’m often surprised at the patience some people have to sit there and manually re-run the same command dozens of times. Regardless of your pain threshold, however, you’ll eventually run into a data set that is simply too large to be analyzed or processed piece by piece. Luckily, there are several approaches to task automation that 1) are very simple to use and 2) are portable across most UNIX and UNIX-like systems such as Linux and Mac OS X. This post will give a brief introduction into some of the techniques I have picked up over time.

The for loop

The easiest way to automate repetitive command-line tasks is with the bash for loop. Assuming you’ve exercised at least a minimal amount of discipline in naming your data files, it should be straightforward to apply the same command to many different data files.

The basic structure of a for loop is as follows.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt; done

If you’re putting the commands in a script (or describing them on a blog 🙂 ) it often increases readability if you split it across multiple lines like so.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt
done

The code in the middle gets executed once for each value specified on the first line. That’s it! That’s a for loop. If you’d like a more long-winded version, here’s the break down.

for i in 1 2 3 4 5: This syntax is used to tell the shell that we want to repeat a command multiple times using multiple values. In this case the values are the integers 1 – 5, but these could be anything: names, sample labels, whatever. For this specific example, we could use bash range notation and replace the numbers with {1..5}, which is especially useful if you have a large number of serially numbered files (i.e. file1.txt, file2.txt, file3.txt, and so on).
do someCommand data$i.fastq > output$i.txt: This is the command we want to run multiple times. In this particular example, the command will be executed 5 times, and each time the variable $i will be filled in with one of the values we specified earlier (first 1, then 2, etc). This example assumes that we have 5 input files: data1.fastq, data2.fastq, ..., data5.fastq, and will create 5 corresponding output files: output1.txt, output2.txt, ..., output5.txt.
done: This keyword indicates the end of the loop.

There are some additional things you might want to consider.

How you name your files can make a big difference. There is no programming trick on earth that can help you if your input files are named like data_1.fastq, data-2.fastq, data3.fastq, data_4.fq, and data_five.fastq. Although it’s easy for us as humans to see what the pattern is, simplifying this so that a computer can easily process it requires you to be consistent with your file naming schemes.
Embedding a variable in a filename or command that has underscores can be problematic, since the underscore is a valid character for bash variables. Let’s take the example from above, but imagine instead the input files are named like data_1_trim.fq, data_2_trim.fq, and so on. We might be tempted to wrap someCommand data_$i_trim.fq > output$i.txt in our for loop, but this wouldn’t work. Bash will interpret $i_trim as a variable, instead of just $i as was intended. If there is ever any possibility of confusion, it’s always better to use explicit syntax and wrap variables in braces like so: someCommand data_${i}_trim.fq > output${i}.txt.

The for loop, revisited

In a regular for loop, each command is executed sequentially. That is, the first iteration is executed, bash waits for it to finish, and then only when it is complete does the loop proceed to run the second iteration (and so on for all iterations). If our loop runs 5 command, and each takes 1 minute, then the regular for loop approach will take 5 minutes to run.

A very simple modification allows you to run all of the iterations simultaneously in parallel, without waiting for the first command to finish. This will allow us to run all 5 commands at once, getting all of our results in 1 minute (instead of 5). This is done by placing the ampersand character after the command in the for loop.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt & done

Here it is again in expanded form.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt &
done

There are a couple of things to consider here as well.

The ampersand character tells bash to run the command in the background. If you are typing these commands directly into your shell you should have no problems with this. However, if you have stored the commands in a script you will need to include a wait command after the loop. This is because bash scripts do not normally wait for background processes to finish before exiting. The wait command forces the bash script to wait for the loop to finish before exiting.
The ampersand provides a very simple and powerful extension of the for loop, but it requires caution. With this approach, each iteration of the loop will spawn a new process, and if the number of spawned processes exceeds the number of available processors/cores, this could lead to performance issues. Only use this approach if you are sure there are more processors on your machine than iterations in your loop.

The GNU parallel command

Although the bash for loop is simple and powerful, there are cases where it doesn’t work too well. This is primarily when you have loops with a large number of iterations and you want to speed up these iterations by using multiple processors, but the number of iterations is much more than the number of processors. For instance, you may have hundreds of samples you want to run some quality control command on, and your desktop has 16 processors. The normal for loop described above is not optimal because it will only use 1 processor at a time. The parallelized for loop described above doesn’t work because it will try to run all of the samples at once, quickly overloading the computer. We need a something that will run the command on all of our hundreds of samples, but only keep 16 processes running at a time. Enter the GNU parallel command.

Let’s continue with the scenario described above, but instead imagine we had 512 input data files instead of 5. Assuming they file names are numbered appropriately, we can process these files, 16 files at a time, with the following command.

parallel --jobs 16 someCommand data{}.fastq '>' output{}.fastq ::: {1..512}

If you take a moment to look at this, it’s very similar to the for loop.

Instead of specifying the iteration values at the beginning (1 2 3 4 5 or just {1..5}), we specify them at the end after the triple colon. Alternatively, parallel can read these from standard input or a file.
Instead of using a loop variable like $i, we simply use empty curly braces {}.

There are a few considerations to note here as well.

Note the file redirect symbol > in quotes. If you do not put this in quotes, it will be interpreted by the shell instead of the parallel command.
The GNU parallel command is relatively recent and is not available on many systems by default. If you’re on Linux, it should be fairly easy to find and install using your system’s package manager (apt-get, yum, and so on). On Mac you can install it using Homebrew.
Some versions of parallel may require you to add the --gnu flag for correct behavior. I have no idea which versions require this or why. Basically, if the command fails right away with an error message try adding or removing the flag.
The parallel command supports multiple arguments per command. This isn’t really helpful for the example discussed above, but check out the man page for parallel, and specifically the description of the -N flag and numbered arguments.

Thanks to oletange for pointing out the parallel file redirection issue.

March 13, 2015

Writing clean Bash code

The shell is a paradox. It is mysterious in its complexity and marvelous in its simplicity. You can learn the bare basics with just of couple hours of tinkering, but becoming a shell master takes years of consistent use, and even then it feels like you learn something new every day. On one hand, it has an ugly syntax, and trying to work with strings and arrays is beyond frustrating. On the other hand, there’s nothing quite like the zen achieved when you stitch together a half dozen shell commands to extract and format your data of interest, creating a pipeline that outperforms any Python or C++ code you could ever write.

Whatever your feelings about the shell in general, or about (the most popular shell) Bash in particular, you cannot deny that it is a rustic language with very poor debugging facilities that makes it very easy to shoot yourself in the foot. If you can avoid writing code in Bash, you should. In fact, as blogger Jeff Atwood says, if you can avoid writing code at all, you should. But if you’re a modern biologist then chances are that sooner or later the shell will be the best tool for a particular job. What do you do then?

Over the last few months I’ve really tried to look for ways to improve my shell code. Here are some of the resources I’ve found.

Defensive Bash Programming: The author of this blog post provides some suggestions for keeping Bash code clean and readable—often a real challenge when stitching together shell commands in a pipeline, for example. He also provides suggestions for organizing the code and using variables carefully to avoid common pitfalls. If I’m writing a simple Bash script with just a few lines, I’d probably ignore most of this advice, but the longer the script gets, the more important it is to take extra steps to be careful and make your intent clear.
Shellcheck: So I was telling the truth when I said there’s no debugger for Bash, but this tool does its best to remedy that. If you paste a shell script into Shellcheck, it will analyze the code and point out potential issues or pitfalls.
Unofficial Bash strict mode: This is probably my favorite. The Perl language has a strict mode that makes writing and debugging Perl code a whole lot less frustrating. Bash has no such strict mode, but there are some things you can do to enforce safer behavior in your bash scripts. This blog suggests placing two lines at the beginning of every Bash script, and then describes why this is a good idea in a variety of scenarios. I haven’t taken his suggestion fully to heart quite yet, but I have begun placing set -eo pipefail at the beginning of most of my Bash scripts, and I have already saved myself tons of heartache and wasted time by doing so.

So the next time you absolutely, positively have to write some shell code, consider referring to these resources to improve the experience!