Reproducibility on a technical level vs a conceptual level

I am and have been a huge proponent of reproducibility in scientific computation for most of my training. I think that reproducibility is important both on a technical level and on a conceptual level. Reproducibility at a technical level—what many call replicability—is the ability successfully generate a similar/identical output given the source code and data. Reproducibility at the conceptual level—what we often refer to as reproducibility in a more general sense—is being able to get a consistent result using different methods (e.g. re-implement the logic in your own software), different data (e.g. do your own data collection on the subject of interest), or both. The latter—reproducibility at the conceptual level—is the ultimate test, and gives much stronger support of a theory/model/result than replicability.

In some ways, reproducing a computational result in biology is much easier than reproducing an experimental result. Computers are just machines* that execute sets of instructions (*gross oversimplification noted), and assuming I provide the same instructions (source code) and input (data) to another scientist, it should be trivial for them to exactly and precisely reproduce a result I previously computed.

However, I speak of an ideal world. In the real world, there are myriad technical issues: different operating systems; different versions of programming languages and software libraries; different versions of scientific software packages; poor workflow documentation; limited computing literacy; and a variety of other issues that make it difficult to reproduce a computation. In this way, computational biology is a lot more like experimental biology: getting the “right” result depends highly on your environment or setup, your experience, and in some cases having some secret sauce (running commands/programs in the right order, using certain settings or parameter values, etc). In this way, computational biology is a lot more like experimental biology than we let on sometimes.

It doesn’t have to be this way. But it is.

There is a great discussion going on in the social media right now about what should be expected of academic scientists producing software in support of their research. I hope to contribute a dedicated blog post to this discussion soon, but it is still very nascent and I would like to let my thoughts marinate for a bit before I dive in. Several points are clear, however, that are directly relevant to the discussion of replicability and reproducibility.

  • Replicability in computation supports reproducibility. I’m not sure of anyone that disagrees with this. My impression is that most disagreements are focused on what can reasonably be expected of scientists given the current incentive structure.
  • Being unable to replicate a computational study isn’t the end of the world for academic science: important models and theories shouldn’t rely on a single result anyway. But the lack of computational replicability does make the social enterprise of academic science, already an expensive and inefficient ordeal under constant critical scrutiny, even more expensive and inefficient. Facilitating replicability in computation would substantially lower the activation energy required to achieve the kind of real reproducibility that matters in the long run.
  • There are many academics in the life sciences at all levels (undergrad to grad student to postdoc to PI) that are dealing with more data and computation right now than their training has ever prepared them to deal with.
  • A little bit of training can go a long way to facilitating more computational replicability in academic science. Check out what Software Carpentry is doing. Training like this may not benefit cases in which scientists deliberately obfuscate their methods to cover their tracks or to maintain competitive advantage, but it will definitely benefit cases in which a result is difficult to replicate simply because a well-intentioned scientist had little exposure to computing best practices. (Full disclosure: I am a certified instructor for Software Carpentry, although I receive no compensation for my affiliation or participation. I’m just a huge proponent of what they do!)

Debugging in C

When I break out the GNU debugger (gdb command), it is unwillingly. I am dragged, kicking and screaming inside, to the realization that I have a substantial memory management issue in my C program that no amount of print statements will help me wrap my mind around. And only rarely does gdb actually help me isolate my issue. More often than not, I sink lots of time into that dark abyss, only to return to my print statements and banging my head against the wall until I recognize the problem and necessary solution.

Call me old fashioned, but the vast majority of my debugging is done with print statements. I’d like to think they’re thoughtfully placed to maximize diagnostic power, but honestly it’s usually just a messy process that sometimes generates more confusion before it yields enlightenment. Notwithstanding, this is still the quickest and most effective approach I’ve found to troubleshooting.

I recently came across a small utility that assists with debugging in C. It’s a single header file that will drop in easily to any C project, with a non-restrictive MIT license that permits any use (including commercial) as long as attribution is made to the original author.

What do I like about this little utility? What does it give me over my trusty fprintf statements? Not a whole lot honestly, but what it does provide is nice.

  • As mentioned before, it’s dead simple to integrate. Just drop it into your includes directory and use it immediately.
  • It uses some C11 magic to grab and print the variable name for you. More than half of the time I spend writing debugging print statements is spent formatting strings like ID=%s length=%d, so having this handled automagically is a huge potential time saver.
  • All debug statements can be disabled with a simple #define statement.

Most of the programming I’ve been doing recently has been in Python, so I haven’t integrated this into any active projects yet. But I think I’ll give this a shot for the reasons mentioned above.

Small data sets for bioinformatics instruction

This semester I am teaching a computational genomics class for the third time. I’ve learned a lot from the previous iterations of the course, but one thing we’ve struggled with each year is the time/focus balance between technical tasks (such as What are the mechanics of running this genome assembler? or How can I execute this command on all of my sequence data files?) and science tasks (such as How do I evaluate the quality of this gene annotation?). Obviously in a class like this, answering any science question requires some technical work, and unfortunately most of the students have very little training in basic data and computing literacy.

One way I have attempted to improve the situation this semester is to provide small exercises for each unit that involve small data sets that run very quickly. I have done this to some extent in previous years, but I have been much more consistent and intentional this semester, encoding it in the formal class structure. The idea is that each unit has an exercise and an assignment. The exercise is intended to be executed by rote, and is there to give each student immediate feedback on whether they are doing things correctly. Only when they are comfortable completing the exercise (and understanding the purpose of each step) can the student proceed to complete the full assignment, which involves larger data sets and more critical thinking and intellectual input on their part.

What I wanted to highlight in this post, however, are the tools I use for creating small data sets for preliminary exercises.

  • wgsim: This package is the little read simulator that could. It does one thing, and it does it extremely well. The code hasn’t been updated since 2011, but it’s written in C and compiles nicely on pretty much any UNIX system. I’ve found simulating reads to be very helpful for mapping exercises, although they could also be used for assembly as well. I essentially selected a 150 kb region of a genome of interest and used wgsim to simulate something like 10k-100k paired-end reads. This amount of data is enough to create some read pileups so that tview analysis isn’t completely boring, but it’s a small enough data set that bwa or bowtie will run in seconds.
  • This script from the khmer package performs a “digital normalization” procedure to discard redundant data from regions of high-coverage. The 10-second elevator pitch for diginorm is that it removes data without removing information. It is not uncommon for this procedure to discard 75-90% of the data in a Fastq file, which will allow assembly jobs to run much more quickly and in much less memory.
  • This script, also part of khmer, uses reservoir sampling to pull out a random subset of reads from a Fasta/Fastq file. If you have done digital normalization and the data set is still not small enough, I would recommend using this script to discard even more data. Of course, this type of data reduction is (intentionally) lossy, so any downstream interpretations (if any) will have to be adjusted.

Running bioinformatics can often be challenging, especially for the uninitiated. Critically evaluating bioinformatics software is even harder, but I have found these tools to be invaluable resources for both teaching and research purposes.

In silico empirical estimation of paired-end insert size

I’m no expert in library preparation for DNA sequencing, but as I understand it, protocols for paired-end and mate-pair sequencing involve a size selection step that selects DNA fragments within a tight target size range. By controlling the size of sequenced DNA fragments (or “inserts”, borrowing terminology from the days when DNA fragments were actually inserted into vectors and clonally amplified), assembly and alignment software are better able to place reads in their correct locations.

Size selection is not perfect though, so even if you select for, say, 500 bp inserts, you’ll end up with a range of insert sizes. To accommodate this, assemblers and read aligners typically ask users to provide a mean and standard deviation for the insert size. What values do you provide? How wide is the length distribution?

I’ll be frankly honest and admit that for most of my career, I’ve simply provided the target insert size as the mean and a “reasonable” value for the standard deviation (please don’t ask me to define “reasonable” 🙂 ). There are, however, better ways to do this. Hopefully your sequencing center used gel electrophoresis to confirm that the size selection step didn’t fail miserably, and the position of the band in that gel, relative to a ladder, should make the mean insert size pretty clear.

But what if your sequencing center did not provide you with this information, or what if you want a second opinion? Are there any bioinformatics approaches that can help?

Sure! We can empirically determine the distribution of insert sizes by aligning the reads to the genome sequence and observing the outer distance of the paired read alignments. The protocol might look something like this.

# Align paired-end reads with BWA
bwa index genome.fa
bwa mem genome.fa reads1.fq reads2.fq > align.sam

# Convert the alignments to BAM format and sort with SAMtools
samtools view -bS align.sam | samtools sort -o align-sorted -

# Collect insert size metrics with Picard
java -jar /path/to/picard.jar CollectInsertSizeMetrics \
    HISTOGRAM_FILE=inserts-hist.pdf \
    INPUT=align-sorted.bam \

# Graphic of the read length distribution will be in `inserts-hist.pdf`.

But wait, you object, I’m doing a de novo genome assembly. You’re telling me that I have to have my genome sequence to determine what insert size I should use for reconstructing my genome sequence?!?!

Yeah, I know. Assembling the genome and mapping the reads to that genome assembly both require you to specify insert sizes, so it’s a bit of a chicken-and-egg problem. However, genome assembly is hard, and it’s unlikely that you’ll get it right on your first try. Assuming that your first assembly attempt isn’t a complete and utter failure, you should be able to use that as a reference to align your reads and determine the insert size distribution, which you can then use for subsequent assemblies.

In fact, I would recommend doing this as soon as possible after receiving your data to ensure there weren’t big issues with the sequencing. If your 10 kb mate pair library turns out to be mostly 300 bp inserts, and you report this to your sequencing center a month after they give you your data, chances are they can do something about it. However, if it takes you 2 years to assemble and annotate the genome, and only then you check this, there’s probably nothing that can be done.

The fastest darn Fastq decoupling procedure I ever done seen

There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…

The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.

I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.

The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.

I’ve blogged before about how amazingly useful the paste command can be for working with Fastq data. In fact, combining paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!

paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) \
    | tr '\t' '\n' \
    > reads-int.fastq

However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses paste and process substitution, but it also takes advantage of the oft-overlooked tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).

paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

This procedure did the conversion in 2.5 minutes, while I terminated the incomplete procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.

*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.

‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.

sequniq: tools for identifying and managing duplicated sequences

In next-generation sequence data, duplicates of non-biological origin (such as PCR duplicates) are fairly common and can complicate tasks such as assembly and expression quantification, and jeopardize the accuracy of the results. If reads are sampled at random from a genome, it’s very unlikely that two reads will be sampled from exactly the same position. For paired-end data especially, there is essentially no other explanation for two read pairs being identical other than some kind of amplification or other technical artifact. Obviously, it’s important to scan our NGS data for duplicates and remove them as part of our preliminary quality control. This is a pretty simple task, so software solutions should abound, right? Right!?!?

Well, it’s very possible that I’m missing something, but I’ve only been able to find a single tool for removal of duplicate reads (FastUniq, described in this PLoS ONE paper). I did see tools for removal of duplicate read alignments, but that’s not going to help much if you’re working on a de novo assembly. I went ahead and downloaded FastUniq, ran it on my data, and verified the results. The user interface was a bit kludgy, but boy have I seen worse!! At least it was well documented and worked as advertised.

Later, I was working with two different annotated genome assemblies for the same organism, which I had parsed into individual loci based on gene content. I was interested in determining how many loci were strictly identical between the two assemblies. As I started writing code to do this task, I realized how similar it was to the simple task of removing duplicate reads. In duplicate read removal, we store read sequences in memory, and then whenever we encounter a sequence we’ve already seen before we discard it. Now, for identifying identical loci, the task was similar, but instead I wanted to report “duplicates” instead of discarding them1.

After realizing this, I took a few minutes to extend and generalize my code so that it could handle both use cases. I then tested it on a data set I had just processed with FastUniq. My code ran in 29 minutes, whereas FastUniq had required 69 minutes. This does not include the time required to convert between paired and interleaved data (FastUniq requires paired files, my code requires interleaved), nor does it include the time required to shuffle the FastUniq output2. So some simple of-the-cuff Python code was able to outperform the only tool I could find for removing duplicates de novo.

I wrapped up my code as a Python package and posted it on Github. It still needs a bit of polishing, which I may or may not ever get around to doing.

1Because the latter approach involves storing longer genomic sequences, I wrote the code to store SHA1 hashes of each sequence, rather than the sequence itself. This drastically reduced memory consumption without an appreciable increase in runtime.

2For some reason that I did not understand from my very cursory skimming of the FastUniq paper, the FastUniq tool includes a sorting step. Therefore, the first read of the output is the read with the most consecutive adenines (As) at the beginning of the read. Looking at the output set off all kinds of red flags and alarms in my brain. Anyone who has taken a course on data structures or algorithms knows that weird things can happen performance-wise when unsorted data is expected but sorted data is provided (and all NGS analysis tools expect unsorted data). I would not be comfortable using FastUniq’s output directly without first shuffling the data, which only increases the margin of improvement of my Python code over FastUniq!

Task automation with bash and parallel

Nobody gains significant experience in genome informatics without the drudgery of running the same command(s) over and over again on a set of files. Everybody has a different tolerance level for tedious tasks—I’m often surprised at the patience some people have to sit there and manually re-run the same command dozens of times. Regardless of your pain threshold, however, you’ll eventually run into a data set that is simply too large to be analyzed or processed piece by piece. Luckily, there are several approaches to task automation that 1) are very simple to use and 2) are portable across most UNIX and UNIX-like systems such as Linux and Mac OS X. This post will give a brief introduction into some of the techniques I have picked up over time.

The for loop

The easiest way to automate repetitive command-line tasks is with the bash for loop. Assuming you’ve exercised at least a minimal amount of discipline in naming your data files, it should be straightforward to apply the same command to many different data files.

The basic structure of a for loop is as follows.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt; done

If you’re putting the commands in a script (or describing them on a blog 🙂 ) it often increases readability if you split it across multiple lines like so.

for i in 1 2 3 4 5
  someCommand data$i.fastq > output$i.txt

The code in the middle gets executed once for each value specified on the first line. That’s it! That’s a for loop. If you’d like a more long-winded version, here’s the break down.

  1. for i in 1 2 3 4 5: This syntax is used to tell the shell that we want to repeat a command multiple times using multiple values. In this case the values are the integers 1 – 5, but these could be anything: names, sample labels, whatever. For this specific example, we could use bash range notation and replace the numbers with {1..5}, which is especially useful if you have a large number of serially numbered files (i.e. file1.txt, file2.txt, file3.txt, and so on).
  2. do someCommand data$i.fastq > output$i.txt: This is the command we want to run multiple times. In this particular example, the command will be executed 5 times, and each time the variable $i will be filled in with one of the values we specified earlier (first 1, then 2, etc). This example assumes that we have 5 input files: data1.fastq, data2.fastq, ..., data5.fastq, and will create 5 corresponding output files: output1.txt, output2.txt, ..., output5.txt.
  3. done: This keyword indicates the end of the loop.

There are some additional things you might want to consider.

  • How you name your files can make a big difference. There is no programming trick on earth that can help you if your input files are named like data_1.fastq, data-2.fastq, data3.fastq, data_4.fq, and data_five.fastq. Although it’s easy for us as humans to see what the pattern is, simplifying this so that a computer can easily process it requires you to be consistent with your file naming schemes.
  • Embedding a variable in a filename or command that has underscores can be problematic, since the underscore is a valid character for bash variables. Let’s take the example from above, but imagine instead the input files are named like data_1_trim.fq, data_2_trim.fq, and so on. We might be tempted to wrap someCommand data_$i_trim.fq > output$i.txt in our for loop, but this wouldn’t work. Bash will interpret $i_trim as a variable, instead of just $i as was intended. If there is ever any possibility of confusion, it’s always better to use explicit syntax and wrap variables in braces like so: someCommand data_${i}_trim.fq > output${i}.txt.

The for loop, revisited

In a regular for loop, each command is executed sequentially. That is, the first iteration is executed, bash waits for it to finish, and then only when it is complete does the loop proceed to run the second iteration (and so on for all iterations). If our loop runs 5 command, and each takes 1 minute, then the regular for loop approach will take 5 minutes to run.

A very simple modification allows you to run all of the iterations simultaneously in parallel, without waiting for the first command to finish. This will allow us to run all 5 commands at once, getting all of our results in 1 minute (instead of 5). This is done by placing the ampersand character after the command in the for loop.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt & done

Here it is again in expanded form.

for i in 1 2 3 4 5
  someCommand data$i.fastq > output$i.txt &

There are a couple of things to consider here as well.

  • The ampersand character tells bash to run the command in the background. If you are typing these commands directly into your shell you should have no problems with this. However, if you have stored the commands in a script you will need to include a wait command after the loop. This is because bash scripts do not normally wait for background processes to finish before exiting. The wait command forces the bash script to wait for the loop to finish before exiting.
  • The ampersand provides a very simple and powerful extension of the for loop, but it requires caution. With this approach, each iteration of the loop will spawn a new process, and if the number of spawned processes exceeds the number of available processors/cores, this could lead to performance issues. Only use this approach if you are sure there are more processors on your machine than iterations in your loop.

The GNU parallel command

Although the bash for loop is simple and powerful, there are cases where it doesn’t work too well. This is primarily when you have loops with a large number of iterations and you want to speed up these iterations by using multiple processors, but the number of iterations is much more than the number of processors. For instance, you may have hundreds of samples you want to run some quality control command on, and your desktop has 16 processors. The normal for loop described above is not optimal because it will only use 1 processor at a time. The parallelized for loop described above doesn’t work because it will try to run all of the samples at once, quickly overloading the computer. We need a something that will run the command on all of our hundreds of samples, but only keep 16 processes running at a time. Enter the GNU parallel command.

Let’s continue with the scenario described above, but instead imagine we had 512 input data files instead of 5. Assuming they file names are numbered appropriately, we can process these files, 16 files at a time, with the following command.

parallel --jobs 16 someCommand data{}.fastq '>' output{}.fastq ::: {1..512}

If you take a moment to look at this, it’s very similar to the for loop.

  • Instead of specifying the iteration values at the beginning (1 2 3 4 5 or just {1..5}), we specify them at the end after the triple colon. Alternatively, parallel can read these from standard input or a file.
  • Instead of using a loop variable like $i, we simply use empty curly braces {}.

There are a few considerations to note here as well.

  • Note the file redirect symbol > in quotes. If you do not put this in quotes, it will be interpreted by the shell instead of the parallel command.
  • The GNU parallel command is relatively recent and is not available on many systems by default. If you’re on Linux, it should be fairly easy to find and install using your system’s package manager (apt-get, yum, and so on). On Mac you can install it using Homebrew.
  • Some versions of parallel may require you to add the --gnu flag for correct behavior. I have no idea which versions require this or why. Basically, if the command fails right away with an error message try adding or removing the flag.
  • The parallel command supports multiple arguments per command. This isn’t really helpful for the example discussed above, but check out the man page for parallel, and specifically the description of the -N flag and numbered arguments.

Thanks to oletange for pointing out the parallel file redirection issue.

Validating genome annotations revisited: gt speck

I have written previously (here and here) about my frustrations with the lack of validation tools for GFF3 data. I had even begun prototyping my own system for writing schemas that could be used to validate GFF3 files (much like XML schemas used in validating XML files). I hadn’t gotten very far, though, before the talented folks behind GenomeTools released the gt speck tool. The speck tool uses a domain-specific language where individuals, users, groups, or communities of practice can explicitly enforce a set of rules regarding how annotations should be formatted, how features should be related, and which attributes are required. Most scientists and software have requirements beyond what is detailed in the GFF3 specification, but in most cases these are implicit and inconsistently enforced. I look forward to integrating the speck tool into my annotation quality control workflow, and I have great faith that the genomics community has a lot of benefit to gain from wide adoption of tools like this.

Writing clean Bash code

The shell is a paradox. It is mysterious in its complexity and marvelous in its simplicity. You can learn the bare basics with just of couple hours of tinkering, but becoming a shell master takes years of consistent use, and even then it feels like you learn something new every day. On one hand, it has an ugly syntax, and trying to work with strings and arrays is beyond frustrating. On the other hand, there’s nothing quite like the zen achieved when you stitch together a half dozen shell commands to extract and format your data of interest, creating a pipeline that outperforms any Python or C++ code you could ever write.

Whatever your feelings about the shell in general, or about (the most popular shell) Bash in particular, you cannot deny that it is a rustic language with very poor debugging facilities that makes it very easy to shoot yourself in the foot. If you can avoid writing code in Bash, you should. In fact, as blogger Jeff Atwood says, if you can avoid writing code at all, you should. But if you’re a modern biologist then chances are that sooner or later the shell will be the best tool for a particular job. What do you do then?

Over the last few months I’ve really tried to look for ways to improve my shell code. Here are some of the resources I’ve found.

  • Defensive Bash Programming: The author of this blog post provides some suggestions for keeping Bash code clean and readable—often a real challenge when stitching together shell commands in a pipeline, for example. He also provides suggestions for organizing the code and using variables carefully to avoid common pitfalls. If I’m writing a simple Bash script with just a few lines, I’d probably ignore most of this advice, but the longer the script gets, the more important it is to take extra steps to be careful and make your intent clear.
  • Shellcheck: So I was telling the truth when I said there’s no debugger for Bash, but this tool does its best to remedy that. If you paste a shell script into Shellcheck, it will analyze the code and point out potential issues or pitfalls.
  • Unofficial Bash strict mode: This is probably my favorite. The Perl language has a strict mode that makes writing and debugging Perl code a whole lot less frustrating. Bash has no such strict mode, but there are some things you can do to enforce safer behavior in your bash scripts. This blog suggests placing two lines at the beginning of every Bash script, and then describes why this is a good idea in a variety of scenarios. I haven’t taken his suggestion fully to heart quite yet, but I have begun placing set -eo pipefail at the beginning of most of my Bash scripts, and I have already saved myself tons of heartache and wasted time by doing so.

So the next time you absolutely, positively have to write some shell code, consider referring to these resources to improve the experience!

Literate programming, RStudio, and IPython Notebook

A substantial portion of scientific software is written by scientists with little or no formal training in software engineering. As a result, many scripts and programs are untested, unreliable, a huge pain to install or configure, and unlikely to be accessible for more than a few months or years. One does not need to be a software engineer to write good, useful software. For example, simply adding a reasonably descriptive command-line interface to a script or program does not take much work, and can go a long way in making your program easy to use. Placing the software under version control and adding automated tests also provide a huge benefit without a significant amount of additional effort. An aside: I was recently certified as an instructor for Software Carpentry, which provides excellent 2-day workshops introducing these topics at a beginner level to scientists.

My point is that if you’re writing software to be distributed amongst collaborators and colleagues, there are well established software engineering practices to make this a painless experience for everyone involved, and learning the basics doesn’t require much work.

However, a lot of scientific computing does not involve writing scripts or programs to be used by a wide audience. Often, you need to explore the data. You want to compute a variety of statistics and visualize data distributions. You need to download other data sets for comparison, and you feverishly take notes to make sure you can repeat the analysis later if needed—and let’s just be honest here, of course you’re going to have to redo it later! Exploratory data analysis is a tedious and frustratingly messy process. We often have to throw everything we know about sound software engineering out the window in the name of simply getting crap done. And even if you are able to organize your data and scripts and notes just so, sharing this with colleagues and disclosing details for publication can also be very messy.

For over a year I had been hearing a lot about RStudio and IPython Notebook, but it wasn’t until I finally tried them a couple of months ago that I realized how big of a game changer these tools are for data analysis. Both facilitate a kind of literate programming where natural language (like English prose) can be interspersed among source code and the output produced by executing that code. This facilitates placing an entire analysis in a single easy-to-edit document, which renders beautifully and seamlessly with formatted text, blocks of syntax-highlighted source code, and code output, including both text and graphics. These individual documents are easily placed under version control and shared with colleagues via email, GitHub, or online hosting services like RPubs and IPyNb Viewer.

So how do RStudio and IPython Notebook work? RStudio provides an integrated development environment (IDE) for R programming. In that environment, you can create an Rmd (R Markdown) file for your analysis. You can add text to describe your analysis, and Markdown allows you to format the text with headers, bold, italics, hyperlinks, and so on. Any code that you want to execute is placed in code blocks, and then when you want to execute that code you click the “knit HTML” button. This will create an HTML document with your formatted text, your syntax-highlighted code, and any output produced by the code (whether it be text or a graphic).

The IPython Notebook is pretty similar, except that the notebook runs in your web browser instead of an IDE. Perhaps the biggest functional difference between the two is that in IPython Notebook, you can run and re-run individual code blocks independently and see the results. This is helpful if you want to rapidly test small changes without re-running time intensive steps. In RStudio you must re-knit the entire document if you want to see any changes.

So if it’s not clear already how this solves the problem of messy exploratory data analysis, here’s my perspective. Using IPython notebook, on a whim I can launch into a computing task, and my notebook will keep a faithful record of all that I did. At any point, I can save the notebook and close it, and then come back to it later. I can go back through later and add notes, or clean up any existing notes. I can easily publish it online for my collaborators to see, and if they’re so inclined, they can easily download and edit the notebook themselves!

So whether you prefer R or Python for your data analysis needs, we now have excellent free tools at our disposal to take a lot of the pain out of your scientific computing.