My comments on “software solutions for big biology” paper

A paper by Philip Bourne (NIH Associate Director for Data Science) and colleagues came out a couple of weeks ago highlighting some of the issues with software in biology research. The focus of the paper is the scalability of “big data” type solutions, but most of the paper is relevant to all bioinformatics software. Speaking of big data, this Twitter post gave me a good chortle recently.

But I digress…

I really just wanted to highlight and comment on a couple of points from the paper.

  • Biologists are not trained software developers: The paper makes the point that the majority of biologists have zero training in software engineering best practices, and as a result there is a pervasive culture of poorly-designed short-lived “disposable” research software out in the wild. I agree completely with their assessment (in fact I blew a gasket over this as a new Ph.D. student) and think all biologists could benefit from some minimal training in software engineering best practices. However, I think it’s important to emphasize that biologists do not need to become software engineers to write good reproducible research software. In fact, I speak from experience when I say you can spend a lot of time worrying about how software is engineered at the expense of actually using the software to do science. We need to make it clear that nobody expects biologists to become pro programmers, but that investing in some software training early on can yield huge dividends throughout a scientific career.
  • Attribution for bioinformatics software is problematic: The paper emphasizes that papers in “high-impact” journals, even those with a strong bioinformatics component, rarely feature a bioinformatician as first or last author. I get the impression that things are improving ever-so-slowly, but some fairly recent comments from E. O. Wilson make it pretty clear that as a community we still have a long way to go (granted Wilson was talking about stats/math, but his sentiment applies to informatics and software as well).
  • Bioinformatics is a scientific discipline in its own right: and bioinformaticians need career development. ‘Nuff said.
  • Assessment of contribution: One of the final points they make in the paper is that with distributed version control tools and social coding platforms like GitHub, every (substantial) software version can be assigned a citable DOI and relative author contributions can be assessed by looking at the software’s revision history. I am a huge proponent of version control, but this last point about looking at git logs for author contributions doesn’t strike me as very helpful. It may be better than the obligatory vague “Author Contributions” section of the 5+ papers I read this week (A.B., C.D, and E.F. designed the research. A.B. and G.H. performed the research. A.B., E.F, and G.H. wrote the paper.), but only marginally better. Number of revisions committed, number of lines added/removed, and most other metrics easily tracked on GitHub are pretty poor indicators of technical AND intellectual contribution to software development. I think we would be much better of enforcing clearer guidelines for explicitly stating the contributions made by each author.

Overall, I think it was a good piece, and I hope it represents a long-awaited change in the academic community with respect to bioinformatics software.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s