Virtualization and distributing scientific code

I’ve followed several discussions (spats?) on Twitter recently regarding what quality standards should apply to bioinformatics code (this post and this post provide a decent summary). Relevant concerns include why scientists do not more frequently open-source their code, whether they should, whether there is funding to do so, whether there should be funding to do so, whether doing so is necessary to replicate a computational method, and whether code necessarily needs to be of distribution quality for release.

This is a complicated issue and I definitely don’t have all the answers. However, I want to debunk the claim that complex system requirements (imported modules / libraries, system tweaks, etc) make installing and using scientific software (for review) more difficult. Ostensibly, yes, for some research code I can see how it could be onerous for a scientist to meticulously describe how to set up the system to be compatible with the software, and equally onerous for a user or reviewer to troubleshoot that procedure. However, virtualization technology provides an excellent solution to this problem. Why not use Virtual Box to set up a virtual machine with all of your pipeline’s prerequisites and distribute that as a supplement to the publication? (by the way, I had this idea before ENCODE released a virtual machine preloaded with code and data described in the publications…) I can’t imagine this taking more than a few hours, which is nothing compared to total amount of time it takes to draft, edit, and revise the corresponding manuscript. If a scientist is incapable of going through the steps to install their pipeline’s prerequisites on another machine and re-run the pipeline to verify the expected results, then either the pipeline is way too complex and kludgy to trust, or the scientist lacks the necessary experience. In either case, I don’t feel comfortable drawing conclusions from any results obtained from that pipeline.


