Assert statements: software crumple zones

Like many bioinformaticians, I have very little formal training in software engineering. As an undergrad, I got a computer science minor and completed coursework in data structures, algorithms, logic, and programming, but never in the art and craft of software engineering. I feel confident in saying that I have come a long way in grad school—despite the lack of incentives for good software engineering in academic research—but what I know now I’ve learned from reading blogs and StackOverflow posts, from using and contributing to open source software projects, and from good old-fashioned trial-and-error.

In the early days of my software development experience, I learned the value of assert statements. Assertions allow you to explicitly declare assumptions about a program’s state at a particular point in time, and will terminate the program immediately if those assumptions are violated. Liberal use of assertions, coupled with automated unit tests and functional tests, help you to quickly identify where things are wrong when your software isn’t working.

Also in my early days, I picked up on the (supposed) axiom that

Assert Statements Should Never Be Used in Production Code™

While the value of assertions during development is undisputed, crashing a website or a mobile app in production is proclaimed as a serious software engineering cardinal sin. Thus, the idea is that you use assertions while writing the software, hope and pray that your automated tests cover a sufficiently broad set of use cases (hint: they don’t), and then when you’re ready to publish your app or launch your website you disable the assertions so that your users are spared the (cyber-)carnage of an all-out crash when your software fails.

A recent post on Twitter (by a professional software engineer working for Apple, I think) got me thinking about this again.

I like the analogy with crumple zones in cars, which are designed to collapse in case of an accident and absorb impact to protect more vital parts of the vehicle. If you follow the tweet and read the replies, it’s clear that many software developers are beginning to recognize the damage mitigation potential of asserts (even in production settings) and that there are worse things than crashing, especially when it comes to database integrity or (even more importantly) sensitive user data.

Furthermore, scientific research software is not the same as the software powering dating websites or mobile phone apps (shocking, I know). The target user of most phone and web apps is often your average shopper, single adult, or business person. The target user of most scientific software, however, is one of a relatively small number of scientists with related research interests and the necessary background to explore a particular set of research questions. Informative assert statements go a long way toward helping you and your colleagues understand whether the software failed because of corrupt/erroneous data, because of a bug in the code, or because of faulty assumptions made by the programmer. And while it’s frustrating to have software crash in any situation, assertions give you the opportunity to crash the program gracefully rather than continuing in an undefined state and running the risk of crashing later with cryptic messages like “Segmentation Fault”, or worse, not crashing at all. There is nothing more terrifying as a scientist than to base your conclusions on incomplete or flawed data that the software should have warned you about.

So to those just getting started in writing scientific research software, I make the following recommendations.

Use assert statements liberally

Use them at the beginning of functions to test your assumptions about the function arguments/parameters. Use them in the middle of functions to test your assumptions about the results of intermediate computations. Use them at the end of functions to test your assumptions about return values. Use them anywhere and everywhere you can to make your assumptions about the data and code explicit.

Provide informative messages

Having an assertion is usually better than not having an assertion, but a good message can make a huge difference in whether anybody can make sense of the error–including you, after not having looked at the code for 3 months, and trying to re-run it while addressing reviewer comments on your manuscript.

This is what I would call “good”.

Assertion failed: (ni == ne - 1), function do_infer, file main.c, line 193

This is what I would call “better”.

Assertion failed: (num_introns == num_exons - 1), function infer_gene_structure,
    file main.c, line 193

And this is what I would call “best”.

Error: while inferring gene structure, found 7 exons and 5 introns (expected 6).
Please check your annotations for directly abutting exon features.
Assertion failed: (num_introns == num_exons - 1), function infer_gene_structure,
    file main.c, line 193

Provide contact info and/or links to a bug tracker

Your software’s source code is already hosted on a service like GitHub or BitBucket, right? If so, your software already has a bug tracker. Include a link to the bug tracker so that users can easily contact you regarding issues they have. I generally prefer using GitHub’s issue tracker to email, since the bug report is public and I can refer to it with patches and pull requests. But in the very least, the users should know how to contact the author of the software in case they run into trouble.

Reproducible software behavior with random seeds

Selecting a random subset of data or generating random numbers is a fairly common bioinformatics programming task. However, verifying correct behavior of software with a random component can be challenging for the uninitiated. Presumably the scientist writing the code would run the software on a handful of small examples and manually check to ensure the output is correct. But it’s not realistic to do this every time the software runs. How can one verify the behavior of such a program in an automated fashion? Alternatively, what if you find a case in which the program produces incorrect output? How do you reproduce that specific case for troubleshooting and testing?

The behavior of a programming language’s random number generator and related features can be predictable with the use of a random seed. Initializing a random number generator with a seed ensures that the same “random” numbers are produced each time. For example, if you have a program that samples 500 paired-end reads from random locations of a 100 kb sequence of genomic DNA, running the program multiple times with the same random seed should produce the exact same 500 reads. You don’t want this kind of behavior in a production end-user environment, but for development and testing this turns out to be very useful.

When it comes to writing research software that has a random component, I would make the following recommendations.

  • Always include an option to set the random seed. Whether your software is a simple R function, a command-line Python script, or a robust C++ library, the interface should always include an option for the user to set the random seed. This allows them (and you) the ability to reproduce specific cases and troubleshoot or verify the software’s behavior.
  • Always report the random seed. Whether or not the user provides a specific random seed, reporting the seed used is crucial for reproducing the software’s behavior. When the code does not explicitly set the seed, programming languages will typically use the current system time to set the seed internally, and it’s not always possible to determine the exact value used. Therefore, when the end user does not specify a specific random seed to use, a good approach is to generate a random number, report that random number to the user, and then re-initialize the random number generator using that value as a seed. Subsequent invocations of that program could then reproduce the behavior by using the same seed.

Here is an example: a Python script that does an admittedly trivial task involving random numbers, but demonstrates how to get predictable behavior with random seeds.

#!/usr/bin/env python

import argparse
import random
import sys

# Define command-line interface
parser = argparse.ArgumentParser(description='Generate random integers')
parser.add_argument('-m', '--max', type=int, default=100,
                    help='maximum value; default is 100')
parser.add_argument('-s', '--seed', type=int, help='random seed')
parser.add_argument('n', type=int, help='# of random integers to generate')
args = parser.parse_args()

# Pick a random seed if the user did not provide one
if not args.seed:
    args.seed = random.randint(0, sys.maxint)
# Set and report the random seed
random.seed(args.seed)
print >> sys.stderr, 'Random seed:', args.seed

# Do the trivial task
print [random.randint(1, args.max) for i in range(args.n)]

And here is a demonstration of how the script works.

My comments on “software solutions for big biology” paper

A paper by Philip Bourne (NIH Associate Director for Data Science) and colleagues came out a couple of weeks ago highlighting some of the issues with software in biology research. The focus of the paper is the scalability of “big data” type solutions, but most of the paper is relevant to all bioinformatics software. Speaking of big data, this Twitter post gave me a good chortle recently.

But I digress…

I really just wanted to highlight and comment on a couple of points from the paper.

  • Biologists are not trained software developers: The paper makes the point that the majority of biologists have zero training in software engineering best practices, and as a result there is a pervasive culture of poorly-designed short-lived “disposable” research software out in the wild. I agree completely with their assessment (in fact I blew a gasket over this as a new Ph.D. student) and think all biologists could benefit from some minimal training in software engineering best practices. However, I think it’s important to emphasize that biologists do not need to become software engineers to write good reproducible research software. In fact, I speak from experience when I say you can spend a lot of time worrying about how software is engineered at the expense of actually using the software to do science. We need to make it clear that nobody expects biologists to become pro programmers, but that investing in some software training early on can yield huge dividends throughout a scientific career.
  • Attribution for bioinformatics software is problematic: The paper emphasizes that papers in “high-impact” journals, even those with a strong bioinformatics component, rarely feature a bioinformatician as first or last author. I get the impression that things are improving ever-so-slowly, but some fairly recent comments from E. O. Wilson make it pretty clear that as a community we still have a long way to go (granted Wilson was talking about stats/math, but his sentiment applies to informatics and software as well).
  • Bioinformatics is a scientific discipline in its own right: and bioinformaticians need career development. ‘Nuff said.
  • Assessment of contribution: One of the final points they make in the paper is that with distributed version control tools and social coding platforms like GitHub, every (substantial) software version can be assigned a citable DOI and relative author contributions can be assessed by looking at the software’s revision history. I am a huge proponent of version control, but this last point about looking at git logs for author contributions doesn’t strike me as very helpful. It may be better than the obligatory vague “Author Contributions” section of the 5+ papers I read this week (A.B., C.D, and E.F. designed the research. A.B. and G.H. performed the research. A.B., E.F, and G.H. wrote the paper.), but only marginally better. Number of revisions committed, number of lines added/removed, and most other metrics easily tracked on GitHub are pretty poor indicators of technical AND intellectual contribution to software development. I think we would be much better of enforcing clearer guidelines for explicitly stating the contributions made by each author.

Overall, I think it was a good piece, and I hope it represents a long-awaited change in the academic community with respect to bioinformatics software.

Basic summary stats for all genomes at NCBI

I’m wrapping up a manuscript this week (woo hoo!), and at one point was looking for a reference for the claim that a genome-wide GC content of ≈30% is one of the lowest observed in any animal genome. I looked at a couple of recently published animal genomes, and couldn’t even find a mention of the genome’s overall nucleotide composition. I know all of the data is accessible from NCBI, but there are hundreds of published animal genomes, and I was not keen on downloading all those genome sequences and computing GC content. Plus, I mean come on, NCBI should have this type of information easily accessible somewhere, right?

It turns out my intuition was correct. There is a “Browse by organism” link on NCBI’s genomes page that brings up a tabular readout of basic summary stats for all genome assemblies submitted to NCBI. Finding the information I needed was as simple as clicking the “Eukaryotes” tab, filtering by group (“Animals”), and sorting by %GC content (they also provide info on genome size and content as well). After skipping the references with missing values for %GC content, I was able to confirm the claim, so I referenced this resource in the paper.

Pandas: querying a data frame and the `SettingWithCopyWarning` message

I’m doing more and more of my data analysis with Python Pandas these days, but I’m far from an expert and still much more comfortable with R. The reason I persist, though, is that Python provides a complete data processing and analysis stack. In the past, I’ve typically used Perl or Python scripts to pre-process data, R to do statistics and visualization, and shell scripts to glue it all together. Python’s rapidly maturing libraries for numerics, statistics, and data analytics, together with an active community and the awesome IPython notebook, make the prospect of doing end-to-end analysis completely in Python quite compelling.

That said, I still have a lot to learn regarding Pandas. Earlier today I was doing some simple operations—taking a subset of a data frame and creating a new column by applying a transformation to another column—and I kept getting a cryptic SettingWithCopyWarning message. Here’s a dummy example that reproduces the message.

>>> # Load data into memory
>>> import pandas
>>> exons = pandas.io.parsers.read_table('test.dat')
>>> exons
                 Species  ID  Length   Context
0        Equus monoceros  e1     162      stop
1          Homo diutinus  e2     111     start
2  Draconis occidentalis  e3      51       cds
3    Basiliscus vulgaris  e4     114  complete
4          Equus pegasus  e5      95       utr
>>>
>>> # Subset the data
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')]
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> import math
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
__main__:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
>>>

I went on to plot the data and there didn’t seem to be any problems. I also followed the link provided in the warning message and read the documentation. If I’m reading things correctly, the warning is issued to inform users when they might be operating on a copy of the original data, rather than the original data itself. In this example, if I make changes to the ungulates variable and want those changes reflected in the original exons variable, then I do not want to be operating on a copy of the data but the original data itself.

However, in my case it wasn’t really a huge concern, so I was able to get rid of the warning message by making my intent to work on a copy of the data explicit.

>>> # Subset the data, but make an explicit copy
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')].copy()
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
>>>
>>> # No errors this time, woohoo!
>>>

If you’re interested in seeing the interactive Python interpreter in action, check out this asciicast.

Great discussion on research software: linkfest

I’ve been following a weeks (months?) long social media discussion on research software that has been very thought-provoking. The questions being discussed include the following.

  • What should we expect/demand of software to be “published” (in the academic sense)?
  • what should community standards of replicability/reproducibility be?
  • Both quick n’ dirty prototypes and robust, well-tested platforms are beneficial to scientific research. How do we balance the need for both? What should our expectations be for research software that falls into various slots along that continuum?

I’ve been hoping to weigh in on the discussion with my own two cents, but I keep on finding more and more great reading on the topic, both from Twitter and from blogs. So rather than writing (and finish formulating!) my opinions on the topic(s), I think I’ll punt and just share some of the highlights from my readings. Linkfest below.

Leading zeros with bash brace expansion

I use bash brace expansion extensively, both in interactive terminal sessions and in shell scripts. The idea is pretty simple: the simple syntax {n..m} is a shortcut to an array containing all of the integers from n to m, inclusive. I’ve used this syntax several times in examples on this blog, including a recent post about task automation in bash.

Today one of my social media acquaintances posted the following on Twitter.

The implication is pretty clear: if you want the values in the array to have leading zeros, just add the leading zeros to the first number in the range. I checked that this works for arbitrary ranges, so things like {001..200} and {00001..50000} also work as expected.

This is a very nice feature I didn’t know about before. Even when you’re lucky enough to have a data set that is nicely organized with well-thought-out filenames, it’s a 50/50 chance whether serially numbered files/samples will have leading zeroes or not. This bash feature makes it dead simple to handle both cases trivially!

GitHub now renders IPython/Jupyter Notebooks!

I’ve written before about literate programming and how I think this could be a big game changer for transparency and reproducibility in science, especially when it comes to data analysis (vs more traditional software engineering). Well, GitHub announced recently that IPython/Jupyter notebooks stored in GitHub repositories will be rendered, rather than presented as raw JSON text as before. This is a very nice development, making it even easier to share data analysis results with others!

Embeddable image links for asciicasts

A couple years ago I stumbled across ascii.io (which is no asciinema.org), a phenomenal tool for recording terminal sessions. Command-line mechanics is one of the biggest barriers of entry for scientists learning informatics, and I’ve found this tool to be much more useful than screenshots or copy/paste text for communicating terminal behavior to students and colleagues.

The ability to embed asciicasts has been around for a while, but because this option relies on JavaScript it has its limits. For example, any script tags I paste into a blog post on WordPress will be stripped out, and the same is true for Markdown files hosted on Github. So up until now I’ve simply been using hyperlinks to refer to asciicasts on pages where I cannot embed JavaScript.

A new option became available recently: you can now embed image links for your asciicasts. Because these are simply images (probably dynamically generated and cached on the asciinema server), they can be placed pretty much anywhere. And although it’s still not as nice as a truly embedded player, the image link looks and behaves almost like a player.

Now that this feature is available, I have gone and added an asciicast link to one of my Github repos. I’m quite pleased with this feature and look forward to putting it to good use in various other ways!

Simulating BS-seq experiments with Sherman

The bioinformatics arm of the Babraham Institute produces quite a few software packages, including some like FastQC that have achieved nearly ubiquitous adoption in the academic community. Other offerings from Babraham include Bismark for methylation profiling and Sherman for simulating NGS reads from bisulfite-treated DNA.

In my computational genome science class, there was a student that wanted to measure the accuracy of methylation profiling workflows, and he identified Bismark and Sherman as his tools of choice. He would identify a sequence of interest, use Sherman to simulate reads from that sequence, run the methylation call procedure, and then assess the accuracy of the methylation calls.

There was a slight problem, though: Sherman is random in its conversion of Cs to Ts, and the number of conversions can only be controlled by things like error rate and conversion rate. By default, Sherman provides no way to “protect” a C from conversion the way a methyl group does on actual genomic DNA. So there was no way to assess the accuracy of the methylation profiling workflow since we had no way of indicating/knowing which Cs should be called methylated!

After a bit of chatting, however, we came up with a workaround. In our genomic DNA fasta file, any Cs we want to protect from conversion (i.e. methylate them in silico) we simply convert to X. Then we run Sherman, which will convert Cs to Ts at the specified conversion rate but will leave Xs alone. Then, after simulating the reads but before the methylation analysis procedure, we simply change Xs back to Cs. This seemed to pass some sanity tests for us, and we contacted the author of Sherman, Felix Krueger (@FelixNmnKrueger), who confirmed that he saw no potential pitfalls with the approach.

I like this little hack, and assuming I ever use it myself in the future, I will probably create a script that does the C -> X conversion from a GFF3 file or a list of “methylation sites” in some other format (the conversion from X back to C is trivial).