I’m doing more and more of my data analysis with Python Pandas these days, but I’m far from an expert and still much more comfortable with R. The reason I persist, though, is that Python provides a complete data processing and analysis stack. In the past, I’ve typically used Perl or Python scripts to pre-process data, R to do statistics and visualization, and shell scripts to glue it all together. Python’s rapidly maturing libraries for numerics, statistics, and data analytics, together with an active community and the awesome IPython notebook, make the prospect of doing end-to-end analysis completely in Python quite compelling.
That said, I still have a lot to learn regarding Pandas. Earlier today I was doing some simple operations—taking a subset of a data frame and creating a new column by applying a transformation to another column—and I kept getting a cryptic
SettingWithCopyWarning message. Here’s a dummy example that reproduces the message.
>>> # Load data into memory >>> import pandas >>> exons = pandas.io.parsers.read_table('test.dat') >>> exons Species ID Length Context 0 Equus monoceros e1 162 stop 1 Homo diutinus e2 111 start 2 Draconis occidentalis e3 51 cds 3 Basiliscus vulgaris e4 114 complete 4 Equus pegasus e5 95 utr >>> >>> # Subset the data >>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') | (exons.Species == 'Equus monoceros')] >>> ungulates Species ID Length Context 0 Equus monoceros e1 162 stop 4 Equus pegasus e5 95 utr >>> >>> # Create a new `LogLength` column >>> import math >>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10)) __main__:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy >>>
I went on to plot the data and there didn’t seem to be any problems. I also followed the link provided in the warning message and read the documentation. If I’m reading things correctly, the warning is issued to inform users when they might be operating on a copy of the original data, rather than the original data itself. In this example, if I make changes to the
ungulates variable and want those changes reflected in the original
exons variable, then I do not want to be operating on a copy of the data but the original data itself.
However, in my case it wasn’t really a huge concern, so I was able to get rid of the warning message by making my intent to work on a copy of the data explicit.
>>> # Subset the data, but make an explicit copy >>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') | (exons.Species == 'Equus monoceros')].copy() >>> ungulates Species ID Length Context 0 Equus monoceros e1 162 stop 4 Equus pegasus e5 95 utr >>> >>> # Create a new `LogLength` column >>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10)) >>> >>> # No errors this time, woohoo! >>>
I’ve written before about literate programming and how I think this could be a big game changer for transparency and reproducibility in science, especially when it comes to data analysis (vs more traditional software engineering). Well, GitHub announced recently that IPython/Jupyter notebooks stored in GitHub repositories will be rendered, rather than presented as raw JSON text as before. This is a very nice development, making it even easier to share data analysis results with others!
In next-generation sequence data, duplicates of non-biological origin (such as PCR duplicates) are fairly common and can complicate tasks such as assembly and expression quantification, and jeopardize the accuracy of the results. If reads are sampled at random from a genome, it’s very unlikely that two reads will be sampled from exactly the same position. For paired-end data especially, there is essentially no other explanation for two read pairs being identical other than some kind of amplification or other technical artifact. Obviously, it’s important to scan our NGS data for duplicates and remove them as part of our preliminary quality control. This is a pretty simple task, so software solutions should abound, right? Right!?!?
Well, it’s very possible that I’m missing something, but I’ve only been able to find a single tool for removal of duplicate reads (FastUniq, described in this PLoS ONE paper). I did see tools for removal of duplicate read alignments, but that’s not going to help much if you’re working on a de novo assembly. I went ahead and downloaded FastUniq, ran it on my data, and verified the results. The user interface was a bit kludgy, but boy have I seen worse!! At least it was well documented and worked as advertised.
Later, I was working with two different annotated genome assemblies for the same organism, which I had parsed into individual loci based on gene content. I was interested in determining how many loci were strictly identical between the two assemblies. As I started writing code to do this task, I realized how similar it was to the simple task of removing duplicate reads. In duplicate read removal, we store read sequences in memory, and then whenever we encounter a sequence we’ve already seen before we discard it. Now, for identifying identical loci, the task was similar, but instead I wanted to report “duplicates” instead of discarding them1.
After realizing this, I took a few minutes to extend and generalize my code so that it could handle both use cases. I then tested it on a data set I had just processed with FastUniq. My code ran in 29 minutes, whereas FastUniq had required 69 minutes. This does not include the time required to convert between paired and interleaved data (FastUniq requires paired files, my code requires interleaved), nor does it include the time required to shuffle the FastUniq output2. So some simple of-the-cuff Python code was able to outperform the only tool I could find for removing duplicates de novo.
I wrapped up my code as a Python package and posted it on Github. It still needs a bit of polishing, which I may or may not ever get around to doing.
1Because the latter approach involves storing longer genomic sequences, I wrote the code to store SHA1 hashes of each sequence, rather than the sequence itself. This drastically reduced memory consumption without an appreciable increase in runtime.
2For some reason that I did not understand from my very cursory skimming of the FastUniq paper, the FastUniq tool includes a sorting step. Therefore, the first read of the output is the read with the most consecutive adenines (As) at the beginning of the read. Looking at the output set off all kinds of red flags and alarms in my brain. Anyone who has taken a course on data structures or algorithms knows that weird things can happen performance-wise when unsorted data is expected but sorted data is provided (and all NGS analysis tools expect unsorted data). I would not be comfortable using FastUniq’s output directly without first shuffling the data, which only increases the margin of improvement of my Python code over FastUniq!
A substantial portion of scientific software is written by scientists with little or no formal training in software engineering. As a result, many scripts and programs are untested, unreliable, a huge pain to install or configure, and unlikely to be accessible for more than a few months or years. One does not need to be a software engineer to write good, useful software. For example, simply adding a reasonably descriptive command-line interface to a script or program does not take much work, and can go a long way in making your program easy to use. Placing the software under version control and adding automated tests also provide a huge benefit without a significant amount of additional effort. An aside: I was recently certified as an instructor for Software Carpentry, which provides excellent 2-day workshops introducing these topics at a beginner level to scientists.
My point is that if you’re writing software to be distributed amongst collaborators and colleagues, there are well established software engineering practices to make this a painless experience for everyone involved, and learning the basics doesn’t require much work.
However, a lot of scientific computing does not involve writing scripts or programs to be used by a wide audience. Often, you need to explore the data. You want to compute a variety of statistics and visualize data distributions. You need to download other data sets for comparison, and you feverishly take notes to make sure you can repeat the analysis later if needed—and let’s just be honest here, of course you’re going to have to redo it later! Exploratory data analysis is a tedious and frustratingly messy process. We often have to throw everything we know about sound software engineering out the window in the name of simply getting crap done. And even if you are able to organize your data and scripts and notes just so, sharing this with colleagues and disclosing details for publication can also be very messy.
For over a year I had been hearing a lot about RStudio and IPython Notebook, but it wasn’t until I finally tried them a couple of months ago that I realized how big of a game changer these tools are for data analysis. Both facilitate a kind of literate programming where natural language (like English prose) can be interspersed among source code and the output produced by executing that code. This facilitates placing an entire analysis in a single easy-to-edit document, which renders beautifully and seamlessly with formatted text, blocks of syntax-highlighted source code, and code output, including both text and graphics. These individual documents are easily placed under version control and shared with colleagues via email, GitHub, or online hosting services like RPubs and IPyNb Viewer.
So how do RStudio and IPython Notebook work? RStudio provides an integrated development environment (IDE) for R programming. In that environment, you can create an Rmd (R Markdown) file for your analysis. You can add text to describe your analysis, and Markdown allows you to format the text with headers, bold, italics, hyperlinks, and so on. Any code that you want to execute is placed in code blocks, and then when you want to execute that code you click the “knit HTML” button. This will create an HTML document with your formatted text, your syntax-highlighted code, and any output produced by the code (whether it be text or a graphic).
The IPython Notebook is pretty similar, except that the notebook runs in your web browser instead of an IDE. Perhaps the biggest functional difference between the two is that in IPython Notebook, you can run and re-run individual code blocks independently and see the results. This is helpful if you want to rapidly test small changes without re-running time intensive steps. In RStudio you must re-knit the entire document if you want to see any changes.
So if it’s not clear already how this solves the problem of messy exploratory data analysis, here’s my perspective. Using IPython notebook, on a whim I can launch into a computing task, and my notebook will keep a faithful record of all that I did. At any point, I can save the notebook and close it, and then come back to it later. I can go back through later and add notes, or clean up any existing notes. I can easily publish it online for my collaborators to see, and if they’re so inclined, they can easily download and edit the notebook themselves!
So whether you prefer R or Python for your data analysis needs, we now have excellent free tools at our disposal to take a lot of the pain out of your scientific computing.