Several weeks ago I saw the following tweet on my Twitter feed.
Any sufficiently complicated bioinformatics pipeline contains an ad hoc, informally-specified reimplementation of Make.
The first thing that came to my mind (besides all the Perl wrappers and Bash scripts I’ve written) was the AllPaths-LG genome assembler (website, paper). I’ve been working for several months on assembling and annotating the genome of a non-model species (a social insect), and AllPaths-LG provided the best results out of all the assemblers we tried (this seems to have been confirmed by some recent work by Steven Salzberg). When I was familiarizing myself with AllPaths-LG, I noted the following in its user manual.
RunAllPathsLGuses the Unix
makeutility to control the assembly pipeline. It does not call each module itself, but instead creates a special
makefilethat does. Within
RunAllPathsLGeach module is defined in terms of its source and target files, and the command line used to call it. A module is only run if its target files don’t exist, or are out of date compared to its source files, or if the command used to call the module has changed. In this way
RunAllPathsLGcan be run again and again, with different parameters, and only those modules that need to be called will be. This is efficient and ensures that all intermediate files are always correct, regardless of how many times
RunAllPathsLGhas been called on a particular set of source data and how many times a module fails or aborts partway through.
The case for implementing bioinformatics pipelines with
Make is hard to dispute. Just because
Make was originally designed for managing complex source code compilation workflows, there is nothing stopping us from using it to manage any workflow of any complexity. A good bioinformatics pipeline should have the inputs and outputs for each module well-defined anyway, and with a bit of experience this should be easy to represent as
Make rules (targets, prerequisites, and recipes). Plus, as suggested by the AllPaths-LG manual, if the pipeline needs to be re-run for any reason (whether it prematurely aborted or some of the input data or parameters were modified),
Make will only run the commands it needs to. Implementing such features in a scripted pipeline can require significant effort, whereas this is built right in to
Which I think was Luis’ point.