Bioinformatics pipelines and Make

Several weeks ago I saw the following tweet on my Twitter feed.

Any sufficiently complicated bioinformatics pipeline contains an ad hoc, informally-specified reimplementation of Make.

The first thing that came to my mind (besides all the Perl wrappers and Bash scripts I’ve written) was the AllPaths-LG genome assembler (website, paper). I’ve been working for several months on assembling and annotating the genome of a non-model species (a social insect), and AllPaths-LG provided the best results out of all the assemblers we tried (this seems to have been confirmed by some recent work by Steven Salzberg). When I was familiarizing myself with AllPaths-LG, I noted the following in its user manual.

RunAllPathsLG uses the Unix make utility to control the assembly pipeline. It does not call each module itself, but instead creates a special makefile that does. Within RunAllPathsLG each module is defined in terms of its source and target files, and the command line used to call it. A module is only run if its target files don’t exist, or are out of date compared to its source files, or if the command used to call the module has changed. In this way RunAllPathsLG can be run again and again, with different parameters, and only those modules that need to be called will be. This is efficient and ensures that all intermediate files are always correct, regardless of how many times RunAllPathsLG has been called on a particular set of source data and how many times a module fails or aborts partway through.

The case for implementing bioinformatics pipelines with Make is hard to dispute. Just because Make was originally designed for managing complex source code compilation workflows, there is nothing stopping us from using it to manage any workflow of any complexity. A good bioinformatics pipeline should have the inputs and outputs for each module well-defined anyway, and with a bit of experience this should be easy to represent as Make rules (targets, prerequisites, and recipes). Plus, as suggested by the AllPaths-LG manual, if the pipeline needs to be re-run for any reason (whether it prematurely aborted or some of the input data or parameters were modified), Make will only run the commands it needs to. Implementing such features in a scripted pipeline can require significant effort, whereas this is built right in to Make.

Which I think was Luis’ point.

Advertisements

2 comments

  1. Pingback: Bash tricks: getopts and die signal handling « BioWize
  2. Pingback: Executable makefiles | BioWize

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s