A paper by Philip Bourne (NIH Associate Director for Data Science) and colleagues came out a couple of weeks ago highlighting some of the issues with software in biology research. The focus of the paper is the scalability of “big data” type solutions, but most of the paper is relevant to all bioinformatics software. Speaking of big data, this Twitter post gave me a good chortle recently.
But I digress…
I really just wanted to highlight and comment on a couple of points from the paper.
- Biologists are not trained software developers: The paper makes the point that the majority of biologists have zero training in software engineering best practices, and as a result there is a pervasive culture of poorly-designed short-lived “disposable” research software out in the wild. I agree completely with their assessment (in fact I blew a gasket over this as a new Ph.D. student) and think all biologists could benefit from some minimal training in software engineering best practices. However, I think it’s important to emphasize that biologists do not need to become software engineers to write good reproducible research software. In fact, I speak from experience when I say you can spend a lot of time worrying about how software is engineered at the expense of actually using the software to do science. We need to make it clear that nobody expects biologists to become pro programmers, but that investing in some software training early on can yield huge dividends throughout a scientific career.
- Attribution for bioinformatics software is problematic: The paper emphasizes that papers in “high-impact” journals, even those with a strong bioinformatics component, rarely feature a bioinformatician as first or last author. I get the impression that things are improving ever-so-slowly, but some fairly recent comments from E. O. Wilson make it pretty clear that as a community we still have a long way to go (granted Wilson was talking about stats/math, but his sentiment applies to informatics and software as well).
- Bioinformatics is a scientific discipline in its own right: and bioinformaticians need career development. ‘Nuff said.
- Assessment of contribution: One of the final points they make in the paper is that with distributed version control tools and social coding platforms like GitHub, every (substantial) software version can be assigned a citable DOI and relative author contributions can be assessed by looking at the software’s revision history. I am a huge proponent of version control, but this last point about looking at git logs for author contributions doesn’t strike me as very helpful. It may be better than the obligatory vague “Author Contributions” section of the 5+ papers I read this week (A.B., C.D, and E.F. designed the research. A.B. and G.H. performed the research. A.B., E.F, and G.H. wrote the paper.), but only marginally better. Number of revisions committed, number of lines added/removed, and most other metrics easily tracked on GitHub are pretty poor indicators of technical AND intellectual contribution to software development. I think we would be much better of enforcing clearer guidelines for explicitly stating the contributions made by each author.
Overall, I think it was a good piece, and I hope it represents a long-awaited change in the academic community with respect to bioinformatics software.
I’ve been following a weeks (months?) long social media discussion on research software that has been very thought-provoking. The questions being discussed include the following.
- What should we expect/demand of software to be “published” (in the academic sense)?
- what should community standards of replicability/reproducibility be?
- Both quick n’ dirty prototypes and robust, well-tested platforms are beneficial to scientific research. How do we balance the need for both? What should our expectations be for research software that falls into various slots along that continuum?
I’ve been hoping to weigh in on the discussion with my own two cents, but I keep on finding more and more great reading on the topic, both from Twitter and from blogs. So rather than writing (and finish formulating!) my opinions on the topic(s), I think I’ll punt and just share some of the highlights from my readings. Linkfest below.
I am and have been a huge proponent of reproducibility in scientific computation for most of my training. I think that reproducibility is important both on a technical level and on a conceptual level. Reproducibility at a technical level—what many call replicability—is the ability successfully generate a similar/identical output given the source code and data. Reproducibility at the conceptual level—what we often refer to as reproducibility in a more general sense—is being able to get a consistent result using different methods (e.g. re-implement the logic in your own software), different data (e.g. do your own data collection on the subject of interest), or both. The latter—reproducibility at the conceptual level—is the ultimate test, and gives much stronger support of a theory/model/result than replicability.
In some ways, reproducing a computational result in biology is much easier than reproducing an experimental result. Computers are just machines* that execute sets of instructions (*gross oversimplification noted), and assuming I provide the same instructions (source code) and input (data) to another scientist, it should be trivial for them to exactly and precisely reproduce a result I previously computed.
However, I speak of an ideal world. In the real world, there are myriad technical issues: different operating systems; different versions of programming languages and software libraries; different versions of scientific software packages; poor workflow documentation; limited computing literacy; and a variety of other issues that make it difficult to reproduce a computation. In this way, computational biology is a lot more like experimental biology: getting the “right” result depends highly on your environment or setup, your experience, and in some cases having some secret sauce (running commands/programs in the right order, using certain settings or parameter values, etc). In this way, computational biology is a lot more like experimental biology than we let on sometimes.
It doesn’t have to be this way. But it is.
There is a great discussion going on in the social media right now about what should be expected of academic scientists producing software in support of their research. I hope to contribute a dedicated blog post to this discussion soon, but it is still very nascent and I would like to let my thoughts marinate for a bit before I dive in. Several points are clear, however, that are directly relevant to the discussion of replicability and reproducibility.
- Replicability in computation supports reproducibility. I’m not sure of anyone that disagrees with this. My impression is that most disagreements are focused on what can reasonably be expected of scientists given the current incentive structure.
- Being unable to replicate a computational study isn’t the end of the world for academic science: important models and theories shouldn’t rely on a single result anyway. But the lack of computational replicability does make the social enterprise of academic science, already an expensive and inefficient ordeal under constant critical scrutiny, even more expensive and inefficient. Facilitating replicability in computation would substantially lower the activation energy required to achieve the kind of real reproducibility that matters in the long run.
- There are many academics in the life sciences at all levels (undergrad to grad student to postdoc to PI) that are dealing with more data and computation right now than their training has ever prepared them to deal with.
- A little bit of training can go a long way to facilitating more computational replicability in academic science. Check out what Software Carpentry is doing. Training like this may not benefit cases in which scientists deliberately obfuscate their methods to cover their tracks or to maintain competitive advantage, but it will definitely benefit cases in which a result is difficult to replicate simply because a well-intentioned scientist had little exposure to computing best practices. (Full disclosure: I am a certified instructor for Software Carpentry, although I receive no compensation for my affiliation or participation. I’m just a huge proponent of what they do!)
About a year ago, someone asked a question about terminology on the Biology Stack Exchange site: are the terms “bioinformatics” and “computational biology” synonyms, or do they refer to different things? My initial response was this.
I think this question is on topic here, although yes you would definitely get a lot of answers at BioStars. But consider this from the bioinformatics tag wiki on this site.
Bioinformatics is a broad field that interfaces a variety of life science disciplines (biology, genetics, biochemistry, biophysics, etc) with a variety of quantitative sciences (mathematics, statistics, computer science, engineering, etc). Bioinformatics techniques typically involve developing and applying software and algorithms to computationally intensive biological questions, such as those common in structural biology, genomics, sequence analysis, and systems biology.
Some scientists draw a distinction between the term bioinformatics and computational biology. While these areas indeed broad and diverse, these distinctions in terms are not consistent or well-defined.
Case in point: @GWW’s answer cites two different definitions, while another has already been suggested in response to his answer (as a comment). More definitions are sure to come from additional answers, comments, and edits. None of these definitions are necessarily wrong, but in the same way none are “right” as there is no objective way to determine which of the definitions is “better” than the others. If you were to ask 5 experts in the field, you are likely to get 5 different definitions.
I still stand by this answer, but when the same topic came up on Twitter recently, Luis Pedro Coelho made an excellent point. He pointed out that while there is no unanymous consensus on the issue, the focus of the top-ranked journals in those fields is telling: Oxford’s Bioinformatics is definitely focused on the informatics side, while PLoS Computational Biology is definitely focused on biology. While this doesn’t change the fact that some will disagree on definitions, which types of articles are published in which journals do have a significant influence on how scientists view these fields.
I’ve written previously about asciiio, a really cool application for recording terminal sessions. About a year ago, I asked the developer if it was possible to embed asciicasts.
In June (two days before my birthday, in fact), he announced that the feature was now implemented.
Somehow I missed this. This is a very exciting and useful feature. Now if only it could be embedded on a WordPress.com blog…
A recent post by Stephen Turner about the woes of posting code on your lab website really resonated with me. As a scientist I have occasionally clicked on a link or copy&pasted a URL from a paper, only to find that the web address I’m looking for no longer exists. Sure it’s frustrating in the short term, but in the long term it’s troubling to think that so much of the collective scientific output has such a short digital shelf life.
This happened to me again just yesterday. I was looking over this paper on algorithms for composition-based segmentation of DNA sequences, and I was interested in running one of the algorithms. The code, implemented in Matlab (good grief), is available (you guessed it!) from their lab website: http://nsm.uh.edu/~dgraur/eran/simulation/main.htm. Following that link takes you to a page with a warning that the lab website has moved, and if you follow that link you end on the splash page for some department or institute that has no idea how to handle the redirect request. This paper is from 2010, and yet I can’t access supplements from their lab website! I did a bit of Google searching and found the first author’s current website, which even included links to the software mentioned in the paper, but unfortunately the links to the code point to the (now defunct) server published in the paper. I finally found the code buried in a Google Code project, and now I’m sitting here wondering whether it was really worth all the hassle in the first place, and whether I even want to check if our institution has a site license for Matlab…
With regards to my own research, I’ve been using hosting services like SourceForge, Github, and BitBucket to host my source code for years. However, I’ve continued using our lab server to host this blog, along with all the supplementary graphics and data that go along with it. I guess I initially enjoyed the amount of control I had. But after reading Stephen’s post, realizing how big of a problem this is in general, and of course thinking of all of the fricking SELinux crap I’ve had to put up with (our lab servers run Fedora and Red Hat), the idea of using a blog hosting service all of a sudden seemed much more reasonable.
So as of this post, the BioWi[sz]e blog is officially migrated to WordPress.com. Unfortunately, someone got the http://biowise.wordpress.com subdomain less than a year ago—they even spent the $25 bucks to reserve a
.me domain, and yet they’re doing nothing with it. Grrr…So anyway, the BioWise you know and love is now BioWize, for better and for worse.
As far as the supplementary graphics and data files, I have followed Stephen Turner’s example and posted everything on FigShare. While uploading data files and providing relevant metadata was very straightforward, there is a bit of a learning curve when it comes to organizing and grouping related files. And once data is publicly published on FigShare, deleting it is not an option, even if you’re just trying to clean things up and fix mistakes. So if I could have done one thing differently, I would have been more careful about how I uploaded and grouped the files. Otherwise, I have no complaints. I love the idea that the content of my blog will be accessible long after I’ve moved on from my current institution (without any additional work on my part), and that all of the supporting data sets are permanently accessible, each with its own DOI.
I recently came across a really cool platform for learning and teaching bioinformatics. It is called Rosalind (named after Rosalind Franklin) and lives at http://rosalind.info/. Throughout my learning career, I’ve learned and/or taught myself many things using bioinformatics-themed tutorials, blog posts, and Q&A forums. I have to say, this new platform looks very exciting in terms of its potential impact.
The following is from Rosalind’s “About” page.
Learning bioinformatics usually requires solving computational problems of varying difficulty that are extracted from real challenges of molecular biology.
To make learning bioinformatics fun and easy, we have founded Rosalind, a platform for learning bioinformatics through problem solving.
Rosalind offers an array of intellectually stimulating problems that grow in biological and computational complexity; each problem is checked automatically, so that the only resource required to learn bioinformatics is an internet connection.
The creators of Rosalind are to be applauded for investing the time and resources required to design and produce learning modules, as well as implement the backend system required to provide automated assessment and feedback. But where Rosalind really shines is the integration of a variety of achievements, as well as tools for instructors.
Not too long ago, you would have expected to hear terms such as “levels”, “experience points”, “achievements”, and “badges” in reference to video games and not a high school- or university-level course. However, in the last few years various types of flair have become very popular on many social platforms. For example, when I first started using the StackExchange Q&A network for questions related to my research, I saw reputation and badges primarily as incentives for continued participation, and as such I considered them somewhat childish and a distraction from the real purpose of the site. But the longer I’ve used StackExchange, the more I’ve come to appreciate the value of these “achievements” as indicators of real experience.
Rosalind allows anybody anywhere to create an account and submit answers for feedback and achievements. However, Rosalind also has a feature that allows any user to set up a “class”. Users can select modules to include or exclude, can optionally set start and end dates for the class, provide enrollments link to students/participants, and Rosalind provides a gradebook to monitor student/participant progress. These tools should eliminate most barriers an instructor would encounter in creating a course (or courses) based on Rosalind resources or integrating Rosalind resources into existing courses.
Achievements awarded by Rosalind (and indeed by other similar social platforms) have a huge potential to carry real, formal academic value in the not-too-distant future. I won’t pretend that a university can or should offer credit without taking measures to ensure students aren’t abusing the system, but institutions that can find innovative ways to leverage open learning tools will have a crucial impact on the future of education in the next few years.
Last week I stumbled upon a really cool application called ascii.io. It provides a nice mechanism for capturing a terminal session and then subsequent video playback. It is implemented as a Python script which seems to be little more than a glorified keylogger. It records what you type (as well as the timing of each keystroke), as well as what is printed out to the terminal. When you finish recording your session, all of this logged information is pushed to the ASCII.IO server and a unique ID and URL is created for the session just recorded.
Here’s a cute little sample I just created.
So how does this relate to biology? It’s no secret that high-throughput sequencing and other “big data” technologies are changing the way biology is done. Many of the best tools for managing and analyzing biological data are available only through the command line. Familiarity with the command line and the ability to process and analyze data using command-line tools is a skill set that is increasing in demand every day, as evidenced by the number and frequency of introductory bioinformatics seminars/tutorials/workshops coming to a campus near you.
The command line is still a mysterious place for many a biologist. Static text or graphics on a web page can only go so far in explaining what the command line is and how it is used in biology and bioinformatics. A tool like ascii.io can make a boring tutorial come alive, making it much more clear what the user types versus what a command prints to the terminal, etc. The benefit of ascii.io over, say, a more traditional screen video capture program, is that storing the text associated with a terminal session is much less bulky than storing video data–especially at the resolution required to make a terminal session clear in a Flash player.
The ascii.io app is free, open source, light weight, and a balanced part of your next bioinformatics seminar/tutorial/workshop!
Advances in technology are constantly and rapidly changing the way we do science. It has never been easier to analyze huge amounts of data, to distribute those data to anywhere in the world, or to maintain long-distance collaborations. Search services like Google, Wikipedia, and PubMed literally put the world’s collective knowledge at our fingertips. The ability to leverage these resources effectively will be an increasingly important skill in the rising generation of scientists. Having the skill for locating, filtering, and synthesizing information has always been more important than the ability to memorize information and repeat procedures. But as we see further advances in technology, the success of a scientist is going to depend more and more on his or her technical aptitude for finding, filtering and integrating relevant information.
Throughout my graduate studies, I have found certain online communities to be extremely helpful in my research. Before I was a graduate student, I found that many of my Google searches for bioinformatics problems led me to a Q&A site called BioStar. The quality of the answers on this site seemed consistently superior to other Google results that I found (blogs, wikis, etc). Eventually, I got brave enough to ask a question of my own on the site and was very impressed with the quick response. Since then, I have used the site extensively in my research, asking questions when I am stuck and contributing answers as my time and expertise allows.
BioStar was originally part of the StackExchange network, which in the last few years has grown into a large network of integrated Q&A sites. Not too long after joining BioStar, I joined StackExchange’s flagship site StackOverflow, which is based on more general programming questions and has the same benefit of high quality answers and quick responses. Since that time, I have used over a dozen StackExchange sites to ask questions relevant to my research, taking advantage of sites dedicated to everything from biology to computer science to statistics to LaTeX. I don’t have time to actively and consistently participate on all of these sites, but they are very useful when questions do come up, and I do occasionally find time to contribute answers to other users’ questions.
Getting the most out of an online scientific community does require a bit of tact. Adapting one’s communication skills for online forums can take a bit of work, but the payoff is well worth it. A well-formulated question posed to the right group can be an excellent research resource, not only for you but for many others who may later have similar questions.
A recent PLoS paper discusses the benefits of online scientific communities in depth, and provides a short list of “rules” for getting the most out of these communities. I would definitely to anyone in biology or bioinformatics to take a few minutes to read this, think about it, and explore some of the communities I just discussed.
Bioinformatics and computational biology are very appealing to a certain type of scientist. Unanswered questions in the life sciences are some of the most interesting and important in the world, and bioinformatics offers a novel approach to answering these questions. This novel approach requires a firm grounding in multiple disciplines, ranging from molecular biology to computer science to statistics to genetics to engineering and everything in between.
Because bioinformatics is such an interdisciplinary field, I often find myself wishing I had more time to learn more things so that I could be more of an “expert” in more areas related to bioinformatics. I do a lot of programming, but I wish I had the time to truly develop the skills of a professional software engineer. I use statistics frequently in my research, but I often lack intuition when it comes to solving even basic problems of probability. Most of my research is related to genetic and genomic sequence data, but my bench skills are so rusty that I wouldn’t trust myself to handle expensive sequencing reagents until I’ve had a few months’ refresher in the lab. The problem is that I don’t have enough time to do a PhD in 3 (or 7!) different traditional scientific disciplines.
I came across this article in PLoS the other day that I thought really addressed this issue well. The author, like me, has a skill set distributed across several “traditional” scientific disciplines. He is an “expert” in his area of research, but his area of research does not fall nicely into any one category. He suggests that just because someone is not an “expert” in any one traditional discipline does not mean they cannot make useful individual contributions to science. In fact, they can make unique individual contributions that otherwise would require the collaboration of “experts.” It’s definitely a good read and gave me some confidence as I press forward in my interdisciplinary (or should I say antedisciplinary) graduate studies.