Like many bioinformaticians, I have very little formal training in software engineering. As an undergrad, I got a computer science minor and completed coursework in data structures, algorithms, logic, and programming, but never in the art and craft of software engineering. I feel confident in saying that I have come a long way in grad school—despite the lack of incentives for good software engineering in academic research—but what I know now I’ve learned from reading blogs and StackOverflow posts, from using and contributing to open source software projects, and from good old-fashioned trial-and-error.
In the early days of my software development experience, I learned the value of assert statements. Assertions allow you to explicitly declare assumptions about a program’s state at a particular point in time, and will terminate the program immediately if those assumptions are violated. Liberal use of assertions, coupled with automated unit tests and functional tests, help you to quickly identify where things are wrong when your software isn’t working.
Also in my early days, I picked up on the (supposed) axiom that
Assert Statements Should Never Be Used in Production Code™
While the value of assertions during development is undisputed, crashing a website or a mobile app in production is proclaimed as a serious software engineering cardinal sin. Thus, the idea is that you use assertions while writing the software, hope and pray that your automated tests cover a sufficiently broad set of use cases (hint: they don’t), and then when you’re ready to publish your app or launch your website you disable the assertions so that your users are spared the (cyber-)carnage of an all-out crash when your software fails.
A recent post on Twitter (by a professional software engineer working for Apple, I think) got me thinking about this again.
I like the analogy with crumple zones in cars, which are designed to collapse in case of an accident and absorb impact to protect more vital parts of the vehicle. If you follow the tweet and read the replies, it’s clear that many software developers are beginning to recognize the damage mitigation potential of asserts (even in production settings) and that there are worse things than crashing, especially when it comes to database integrity or (even more importantly) sensitive user data.
Furthermore, scientific research software is not the same as the software powering dating websites or mobile phone apps (shocking, I know). The target user of most phone and web apps is often your average shopper, single adult, or business person. The target user of most scientific software, however, is one of a relatively small number of scientists with related research interests and the necessary background to explore a particular set of research questions. Informative assert statements go a long way toward helping you and your colleagues understand whether the software failed because of corrupt/erroneous data, because of a bug in the code, or because of faulty assumptions made by the programmer. And while it’s frustrating to have software crash in any situation, assertions give you the opportunity to crash the program gracefully rather than continuing in an undefined state and running the risk of crashing later with cryptic messages like “Segmentation Fault”, or worse, not crashing at all. There is nothing more terrifying as a scientist than to base your conclusions on incomplete or flawed data that the software should have warned you about.
So to those just getting started in writing scientific research software, I make the following recommendations.
Use assert statements liberally
Use them at the beginning of functions to test your assumptions about the function arguments/parameters. Use them in the middle of functions to test your assumptions about the results of intermediate computations. Use them at the end of functions to test your assumptions about return values. Use them anywhere and everywhere you can to make your assumptions about the data and code explicit.
Provide informative messages
Having an assertion is usually better than not having an assertion, but a good message can make a huge difference in whether anybody can make sense of the error–including you, after not having looked at the code for 3 months, and trying to re-run it while addressing reviewer comments on your manuscript.
This is what I would call “good”.
Assertion failed: (ni == ne - 1), function do_infer, file main.c, line 193
This is what I would call “better”.
Assertion failed: (num_introns == num_exons - 1), function infer_gene_structure, file main.c, line 193
And this is what I would call “best”.
Error: while inferring gene structure, found 7 exons and 5 introns (expected 6). Please check your annotations for directly abutting exon features. Assertion failed: (num_introns == num_exons - 1), function infer_gene_structure, file main.c, line 193
Provide contact info and/or links to a bug tracker
Your software’s source code is already hosted on a service like GitHub or BitBucket, right? If so, your software already has a bug tracker. Include a link to the bug tracker so that users can easily contact you regarding issues they have. I generally prefer using GitHub’s issue tracker to email, since the bug report is public and I can refer to it with patches and pull requests. But in the very least, the users should know how to contact the author of the software in case they run into trouble.