# Bioinformatics pipelines and Make

Several weeks ago I saw the following tweet on my Twitter feed.

Any sufficiently complicated bioinformatics pipeline contains an ad hoc, informally-specified reimplementation of Make.

The first thing that came to my mind (besides all the Perl wrappers and Bash scripts I’ve written) was the AllPaths-LG genome assembler (website, paper). I’ve been working for several months on assembling and annotating the genome of a non-model species (a social insect), and AllPaths-LG provided the best results out of all the assemblers we tried (this seems to have been confirmed by some recent work by Steven Salzberg). When I was familiarizing myself with AllPaths-LG, I noted the following in its user manual.

RunAllPathsLG uses the Unix make utility to control the assembly pipeline. It does not call each module itself, but instead creates a special makefile that does. Within RunAllPathsLG each module is defined in terms of its source and target files, and the command line used to call it. A module is only run if its target files don’t exist, or are out of date compared to its source files, or if the command used to call the module has changed. In this way RunAllPathsLG can be run again and again, with different parameters, and only those modules that need to be called will be. This is efficient and ensures that all intermediate files are always correct, regardless of how many times RunAllPathsLG has been called on a particular set of source data and how many times a module fails or aborts partway through.

The case for implementing bioinformatics pipelines with Make is hard to dispute. Just because Make was originally designed for managing complex source code compilation workflows, there is nothing stopping us from using it to manage any workflow of any complexity. A good bioinformatics pipeline should have the inputs and outputs for each module well-defined anyway, and with a bit of experience this should be easy to represent as Make rules (targets, prerequisites, and recipes). Plus, as suggested by the AllPaths-LG manual, if the pipeline needs to be re-run for any reason (whether it prematurely aborted or some of the input data or parameters were modified), Make will only run the commands it needs to. Implementing such features in a scripted pipeline can require significant effort, whereas this is built right in to Make.

Which I think was Luis’ point.

# Virtualization and distributing scientific code

I’ve followed several discussions (spats?) on Twitter recently regarding what quality standards should apply to bioinformatics code (this post and this post provide a decent summary). Relevant concerns include why scientists do not more frequently open-source their code, whether they should, whether there is funding to do so, whether there should be funding to do so, whether doing so is necessary to replicate a computational method, and whether code necessarily needs to be of distribution quality for release.

This is a complicated issue and I definitely don’t have all the answers. However, I want to debunk the claim that complex system requirements (imported modules / libraries, system tweaks, etc) make installing and using scientific software (for review) more difficult. Ostensibly, yes, for some research code I can see how it could be onerous for a scientist to meticulously describe how to set up the system to be compatible with the software, and equally onerous for a user or reviewer to troubleshoot that procedure. However, virtualization technology provides an excellent solution to this problem. Why not use Virtual Box to set up a virtual machine with all of your pipeline’s prerequisites and distribute that as a supplement to the publication? (by the way, I had this idea before ENCODE released a virtual machine preloaded with code and data described in the publications…) I can’t imagine this taking more than a few hours, which is nothing compared to total amount of time it takes to draft, edit, and revise the corresponding manuscript. If a scientist is incapable of going through the steps to install their pipeline’s prerequisites on another machine and re-run the pipeline to verify the expected results, then either the pipeline is way too complex and kludgy to trust, or the scientist lacks the necessary experience. In either case, I don’t feel comfortable drawing conclusions from any results obtained from that pipeline.

# Sharing terminal sessions with ASCII.IO

Last week I stumbled upon a really cool application called ascii.io. It provides a nice mechanism for capturing a terminal session and then subsequent video playback. It is implemented as a Python script which seems to be little more than a glorified keylogger. It records what you type (as well as the timing of each keystroke), as well as what is printed out to the terminal. When you finish recording your session, all of this logged information is pushed to the ASCII.IO server and a unique ID and URL is created for the session just recorded.

Here’s a cute little sample I just created.

So how does this relate to biology? It’s no secret that high-throughput sequencing and other “big data” technologies are changing the way biology is done. Many of the best tools for managing and analyzing biological data are available only through the command line. Familiarity with the command line and the ability to process and analyze data using command-line tools is a skill set that is increasing in demand every day, as evidenced by the number and frequency of introductory bioinformatics seminars/tutorials/workshops coming to a campus near you.

The command line is still a mysterious place for many a biologist. Static text or graphics on a web page can only go so far in explaining what the command line is and how it is used in biology and bioinformatics. A tool like ascii.io can make a boring tutorial come alive, making it much more clear what the user types versus what a command prints to the terminal, etc. The benefit of ascii.io over, say, a more traditional screen video capture program, is that storing the text associated with a terminal session is much less bulky than storing video data–especially at the resolution required to make a terminal session clear in a Flash player.

The ascii.io app is free, open source, light weight, and a balanced part of your next bioinformatics seminar/tutorial/workshop!

# Minted package for LaTeX: syntax highlighting for source code typesetting

In a previous entry, I briefly discussed the benefits of syntax highlighting for text editors. These benefits also apply to source code that is embedded in a web page or a document for distribution. If you use LaTeX to typeset school assignments and/or scientific publications (if not, you should be), then the minted package for LaTeX is excellent for typesetting source code with syntax highlighting in your documents.

Here is an example of a document containing a block of source code that uses minted for syntax highlighting.

\documentclass[letterpaper, 10pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{minted}
\usepackage{hyperref}
\hypersetup
{
colorlinks=false,
pdfborder={0,0,0},
}

\begin{document}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras vitae egestas nunc. Nullam non elit lacinia ante auctor gravida. Quisque id vestibulum dolor. Phasellus fermentum ante id nisi euismod nec ornare lectus fringilla. Praesent metus orci, porttitor vel lobortis eget, scelerisque non magna. In gravida felis vitae urna interdum dignissim. Maecenas tempus dui ut ipsum interdum molestie. Ut et dapibus risus. Etiam tempus, sapien quis feugiat ornare, nibh massa aliquet magna, quis interdum leo lacus eget urna. Proin aliquam lacinia metus, vitae posuere urna luctus nec. Vestibulum vitae lacus non orci placerat adipiscing quis id eros. Ut in enim orci, id pellentesque diam. Nulla dapibus sodales commodo.

\begin{minted}[mathescape,
linenos,
numbersep=5pt,
gobble=2,
frame=lines,
framesep=2mm]{c}
#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
// STEP 1: declare the type of file handler and the read() function
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{
gzFile fp;
kseq_t *seq;
int l;
if (argc == 1) {
fprintf(stderr, "Usage: %s <in.seq>\n", argv[0]);
return 1;
}
fp = gzopen(argv[1], "r"); // STEP 2: open the file handler
seq = kseq_init(fp); // STEP 3: initialize seq
while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence
printf("name: %s\n", seq->name.s);
if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
printf("seq: %s\n", seq->seq.s);
if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
}
printf("return value: %d\n", l);
kseq_destroy(seq); // STEP 5: destroy seq
gzclose(fp); // STEP 6: close the file handler
return 0;
}
\end{minted}

\end{document}


Here is the final typeset product.

For more information about the minted package, see the minted project homepage. Note that minted relies on the Python Pygments syntax highlighter, which can be installed with the following command.

sudo easy_install Pygments


# Easily creating static HTML documents with DokuWiki’s ‘export’ action

I use wikis pretty extensively in my research. I have set one up on my personal machine to serve as an electronic lab notebook, and our research group has another one set up to share notes for group meetings, progress updates, etc. Wikis make it very easy to record and share information without having to write a web page from scratch. On one hand, the basics of using a wiki are simple enough for most anybody to grasp, but on the other hand there are also enough nice features (file uploads, syntax highlighting for embedded code, revision history, etc) to make wikis useful for many data management applications.

This week I discovered that DokuWiki has a built-in export function. This is particularly useful for those that may want a quick and easy way to create a decent-looking static web page. Simply go to any page in the wiki, and then add &do=export_xhtml to the URL (or ?do=export_xhtml if the URL does not already contain a ? symbol).

This generates a web page without any navigational controls that can be downloaded, distributed, and rendered locally without the need for network connectivity. Replace the xhtml with xhtmlbody for just the body of the page (to copy and paste into an existing HTML document), or replace it with raw to get the raw DokuWiki markup.

For more info on the export action and other available actions for DokuWiki, take a look at http://www.dokuwiki.org/devel:action_modes.

# Perl documentation with POD and Pdoc

I just finished writing a Perl module yesterday–it’s an implementation of the interval tree data structure. For code like this (anything that I may ever want to reuse and/or share), I try to include some documentation. In Perl, that usually means a few comment lines here and there. This is sufficient small scripts, but for larger programs or modules there are better alternatives (i.e. Perl’s Plain Old Documentation or POD). I figured this new module be a great chance for me to try out some new documentation approaches.

Perhaps the biggest reason I wanted to try out POD is that you can process POD-documented modules and produce nice HTML files. I use the BioPerl docs quite often (here is an example), and I wanted to create something like that for my module. I knew little about POD, so I figured I would look at the source code of several familiar BioPerl modules on my system. Surely enough, the files included the POD, so I used this as an example of how to document my module. All in all, it took less than an hour to add some comprehensive documentation. Here is a sample of the code.


=head2 find

Title   : find
Usage   : my @overlapping_intervals = $itree->find( 27000, 310000 ); Function: Finds all of the intervals in the tree that overlap with the given interval Returns : An array of interval objects Args : Two integers, representing the start and end positions of an interval =cut sub find { my($self, $start,$end) = @_;
my @overlapping = ();

# Grab overlapping intervals from this node
foreach my $interval(@{$self->{intervals}})
{
push(@overlapping, $interval) if($interval->end() >= $start and$interval->start() <= $end); } # Grab overlapping intervals from the left subtree if( exists($self->{left}) and $start <=$self->{center} )
{
@overlapping = (@overlapping, $self->{left}->find($start, $end)); } # Grab overlapping intervals from the right subtree if( exists($self->{right}) and $end >=$self->{center} )
{
@overlapping = (@overlapping, $self->{right}->find($start, \$end));
}

return @overlapping;
}

1;



Once I was finished, I ran pod2html on the module source code and got this (source here). Definitely not what I was expecting. I mean, the BioPerl docs aren’t exactly the face of beauty themselves, but this just seemed…elementary to me.

I asked the BioPerl mailing list what tool they used for their documentation, and they referred me to the Pdoc package. Downloading the package from their site and installing it on my MacBook was pretty simple.

cd /usr/local/
sudo mv ~/Downloads/pdoc-1.1.tar.gz .
sudo tar xzf pdoc-1.1.tar.gz
cd pdoc-1.1
sudo perl Makefile.PL
sudo make
sudo make install


Running the tool was pretty simple as well. It has many command-line options for customization, but I used the bare minimum: the -source option for indicating the directory where my module was located and the -target option for indicating where to write generate output files.

cd
mkdir html_docs_temp
perl /usr/local/pdoc-1.1/scripts/perlmod2www.pl -source ~/interval_tree_temp -target ~/html_docs_temp


This created an entire documentation browser at ~/html_docs_temp/index.html. It definitely looks much better than the previous example, and while the browser idea is nice (I will be using it in the very near future), all I really needed for this simple example was the IntervalTree.html and the perl.css file. Here are the results (source here).