A substantial portion of scientific software is written by scientists with little or no formal training in software engineering. As a result, many scripts and programs are untested, unreliable, a huge pain to install or configure, and unlikely to be accessible for more than a few months or years. One does not need to be a software engineer to write good, useful software. For example, simply adding a reasonably descriptive command-line interface to a script or program does not take much work, and can go a long way in making your program easy to use. Placing the software under version control and adding automated tests also provide a huge benefit without a significant amount of additional effort. An aside: I was recently certified as an instructor for Software Carpentry, which provides excellent 2-day workshops introducing these topics at a beginner level to scientists.
My point is that if you’re writing software to be distributed amongst collaborators and colleagues, there are well established software engineering practices to make this a painless experience for everyone involved, and learning the basics doesn’t require much work.
However, a lot of scientific computing does not involve writing scripts or programs to be used by a wide audience. Often, you need to explore the data. You want to compute a variety of statistics and visualize data distributions. You need to download other data sets for comparison, and you feverishly take notes to make sure you can repeat the analysis later if needed—and let’s just be honest here, of course you’re going to have to redo it later! Exploratory data analysis is a tedious and frustratingly messy process. We often have to throw everything we know about sound software engineering out the window in the name of simply getting crap done. And even if you are able to organize your data and scripts and notes just so, sharing this with colleagues and disclosing details for publication can also be very messy.
For over a year I had been hearing a lot about RStudio and IPython Notebook, but it wasn’t until I finally tried them a couple of months ago that I realized how big of a game changer these tools are for data analysis. Both facilitate a kind of literate programming where natural language (like English prose) can be interspersed among source code and the output produced by executing that code. This facilitates placing an entire analysis in a single easy-to-edit document, which renders beautifully and seamlessly with formatted text, blocks of syntax-highlighted source code, and code output, including both text and graphics. These individual documents are easily placed under version control and shared with colleagues via email, GitHub, or online hosting services like RPubs and IPyNb Viewer.
So how do RStudio and IPython Notebook work? RStudio provides an integrated development environment (IDE) for R programming. In that environment, you can create an Rmd (R Markdown) file for your analysis. You can add text to describe your analysis, and Markdown allows you to format the text with headers, bold, italics, hyperlinks, and so on. Any code that you want to execute is placed in code blocks, and then when you want to execute that code you click the “knit HTML” button. This will create an HTML document with your formatted text, your syntax-highlighted code, and any output produced by the code (whether it be text or a graphic).
The IPython Notebook is pretty similar, except that the notebook runs in your web browser instead of an IDE. Perhaps the biggest functional difference between the two is that in IPython Notebook, you can run and re-run individual code blocks independently and see the results. This is helpful if you want to rapidly test small changes without re-running time intensive steps. In RStudio you must re-knit the entire document if you want to see any changes.
So if it’s not clear already how this solves the problem of messy exploratory data analysis, here’s my perspective. Using IPython notebook, on a whim I can launch into a computing task, and my notebook will keep a faithful record of all that I did. At any point, I can save the notebook and close it, and then come back to it later. I can go back through later and add notes, or clean up any existing notes. I can easily publish it online for my collaborators to see, and if they’re so inclined, they can easily download and edit the notebook themselves!
So whether you prefer R or Python for your data analysis needs, we now have excellent free tools at our disposal to take a lot of the pain out of your scientific computing.