# My comments on “software solutions for big biology” paper

A paper by Philip Bourne (NIH Associate Director for Data Science) and colleagues came out a couple of weeks ago highlighting some of the issues with software in biology research. The focus of the paper is the scalability of “big data” type solutions, but most of the paper is relevant to all bioinformatics software. Speaking of big data, this Twitter post gave me a good chortle recently.

But I digress…

I really just wanted to highlight and comment on a couple of points from the paper.

• Biologists are not trained software developers: The paper makes the point that the majority of biologists have zero training in software engineering best practices, and as a result there is a pervasive culture of poorly-designed short-lived “disposable” research software out in the wild. I agree completely with their assessment (in fact I blew a gasket over this as a new Ph.D. student) and think all biologists could benefit from some minimal training in software engineering best practices. However, I think it’s important to emphasize that biologists do not need to become software engineers to write good reproducible research software. In fact, I speak from experience when I say you can spend a lot of time worrying about how software is engineered at the expense of actually using the software to do science. We need to make it clear that nobody expects biologists to become pro programmers, but that investing in some software training early on can yield huge dividends throughout a scientific career.
• Attribution for bioinformatics software is problematic: The paper emphasizes that papers in “high-impact” journals, even those with a strong bioinformatics component, rarely feature a bioinformatician as first or last author. I get the impression that things are improving ever-so-slowly, but some fairly recent comments from E. O. Wilson make it pretty clear that as a community we still have a long way to go (granted Wilson was talking about stats/math, but his sentiment applies to informatics and software as well).
• Bioinformatics is a scientific discipline in its own right: and bioinformaticians need career development. ‘Nuff said.
• Assessment of contribution: One of the final points they make in the paper is that with distributed version control tools and social coding platforms like GitHub, every (substantial) software version can be assigned a citable DOI and relative author contributions can be assessed by looking at the software’s revision history. I am a huge proponent of version control, but this last point about looking at git logs for author contributions doesn’t strike me as very helpful. It may be better than the obligatory vague “Author Contributions” section of the 5+ papers I read this week (A.B., C.D, and E.F. designed the research. A.B. and G.H. performed the research. A.B., E.F, and G.H. wrote the paper.), but only marginally better. Number of revisions committed, number of lines added/removed, and most other metrics easily tracked on GitHub are pretty poor indicators of technical AND intellectual contribution to software development. I think we would be much better of enforcing clearer guidelines for explicitly stating the contributions made by each author.

Overall, I think it was a good piece, and I hope it represents a long-awaited change in the academic community with respect to bioinformatics software.

# Great discussion on research software: linkfest

I’ve been following a weeks (months?) long social media discussion on research software that has been very thought-provoking. The questions being discussed include the following.

• What should we expect/demand of software to be “published” (in the academic sense)?
• what should community standards of replicability/reproducibility be?
• Both quick n’ dirty prototypes and robust, well-tested platforms are beneficial to scientific research. How do we balance the need for both? What should our expectations be for research software that falls into various slots along that continuum?

I’ve been hoping to weigh in on the discussion with my own two cents, but I keep on finding more and more great reading on the topic, both from Twitter and from blogs. So rather than writing (and finish formulating!) my opinions on the topic(s), I think I’ll punt and just share some of the highlights from my readings. Linkfest below.

# GitHub now renders IPython/Jupyter Notebooks!

I’ve written before about literate programming and how I think this could be a big game changer for transparency and reproducibility in science, especially when it comes to data analysis (vs more traditional software engineering). Well, GitHub announced recently that IPython/Jupyter notebooks stored in GitHub repositories will be rendered, rather than presented as raw JSON text as before. This is a very nice development, making it even easier to share data analysis results with others!

# Simulating BS-seq experiments with Sherman

The bioinformatics arm of the Babraham Institute produces quite a few software packages, including some like FastQC that have achieved nearly ubiquitous adoption in the academic community. Other offerings from Babraham include Bismark for methylation profiling and Sherman for simulating NGS reads from bisulfite-treated DNA.

In my computational genome science class, there was a student that wanted to measure the accuracy of methylation profiling workflows, and he identified Bismark and Sherman as his tools of choice. He would identify a sequence of interest, use Sherman to simulate reads from that sequence, run the methylation call procedure, and then assess the accuracy of the methylation calls.

There was a slight problem, though: Sherman is random in its conversion of Cs to Ts, and the number of conversions can only be controlled by things like error rate and conversion rate. By default, Sherman provides no way to “protect” a C from conversion the way a methyl group does on actual genomic DNA. So there was no way to assess the accuracy of the methylation profiling workflow since we had no way of indicating/knowing which Cs should be called methylated!

After a bit of chatting, however, we came up with a workaround. In our genomic DNA fasta file, any Cs we want to protect from conversion (i.e. methylate them in silico) we simply convert to X. Then we run Sherman, which will convert Cs to Ts at the specified conversion rate but will leave Xs alone. Then, after simulating the reads but before the methylation analysis procedure, we simply change Xs back to Cs. This seemed to pass some sanity tests for us, and we contacted the author of Sherman, Felix Krueger (@FelixNmnKrueger), who confirmed that he saw no potential pitfalls with the approach.

I like this little hack, and assuming I ever use it myself in the future, I will probably create a script that does the C -> X conversion from a GFF3 file or a list of “methylation sites” in some other format (the conversion from X back to C is trivial).

# Reproducibility on a technical level vs a conceptual level

I am and have been a huge proponent of reproducibility in scientific computation for most of my training. I think that reproducibility is important both on a technical level and on a conceptual level. Reproducibility at a technical level—what many call replicability—is the ability successfully generate a similar/identical output given the source code and data. Reproducibility at the conceptual level—what we often refer to as reproducibility in a more general sense—is being able to get a consistent result using different methods (e.g. re-implement the logic in your own software), different data (e.g. do your own data collection on the subject of interest), or both. The latter—reproducibility at the conceptual level—is the ultimate test, and gives much stronger support of a theory/model/result than replicability.

In some ways, reproducing a computational result in biology is much easier than reproducing an experimental result. Computers are just machines* that execute sets of instructions (*gross oversimplification noted), and assuming I provide the same instructions (source code) and input (data) to another scientist, it should be trivial for them to exactly and precisely reproduce a result I previously computed.

However, I speak of an ideal world. In the real world, there are myriad technical issues: different operating systems; different versions of programming languages and software libraries; different versions of scientific software packages; poor workflow documentation; limited computing literacy; and a variety of other issues that make it difficult to reproduce a computation. In this way, computational biology is a lot more like experimental biology: getting the “right” result depends highly on your environment or setup, your experience, and in some cases having some secret sauce (running commands/programs in the right order, using certain settings or parameter values, etc). In this way, computational biology is a lot more like experimental biology than we let on sometimes.

It doesn’t have to be this way. But it is.

There is a great discussion going on in the social media right now about what should be expected of academic scientists producing software in support of their research. I hope to contribute a dedicated blog post to this discussion soon, but it is still very nascent and I would like to let my thoughts marinate for a bit before I dive in. Several points are clear, however, that are directly relevant to the discussion of replicability and reproducibility.

• Replicability in computation supports reproducibility. I’m not sure of anyone that disagrees with this. My impression is that most disagreements are focused on what can reasonably be expected of scientists given the current incentive structure.
• Being unable to replicate a computational study isn’t the end of the world for academic science: important models and theories shouldn’t rely on a single result anyway. But the lack of computational replicability does make the social enterprise of academic science, already an expensive and inefficient ordeal under constant critical scrutiny, even more expensive and inefficient. Facilitating replicability in computation would substantially lower the activation energy required to achieve the kind of real reproducibility that matters in the long run.
• There are many academics in the life sciences at all levels (undergrad to grad student to postdoc to PI) that are dealing with more data and computation right now than their training has ever prepared them to deal with.
• A little bit of training can go a long way to facilitating more computational replicability in academic science. Check out what Software Carpentry is doing. Training like this may not benefit cases in which scientists deliberately obfuscate their methods to cover their tracks or to maintain competitive advantage, but it will definitely benefit cases in which a result is difficult to replicate simply because a well-intentioned scientist had little exposure to computing best practices. (Full disclosure: I am a certified instructor for Software Carpentry, although I receive no compensation for my affiliation or participation. I’m just a huge proponent of what they do!)

# Be back shortly!

I’m sure the entirety of the interwebz has been sitting on pins and needles wondering when oh when will the BioWize blog be updated! It’s been over a year since I’ve posted an update. This has been a crazy, hectic, exciting, frustrating, and fulfilling year. I have jotted down ideas for a dozen blog posts, and even started drafting a couple, but with stagnant progress on a research project and pressure to make progress with my dissertation I just haven’t figured out how blogging should (if at all) fit in to all this. But I know I need to write more often. I’ve found that writing skills are like a muscle: they need frequent conditioning and exercise or they will atrophy. Whether I’m writing about a technical skill that saves me lots of time in my research, or about something more conceptual, taking the time to write out my thoughts for a general scientific audience is a great exercise in clarifying and condensing your thinking. Communicating your thinking (and supporting evidence) to other scientists is the whole purpose of scientific publishing, after all.

So I expect there to be a flurry of activity on the blog within a short time.

# Brainstorm: motivating student participation in my Computational Genome Science class

A few weeks ago I finished teaching a course on computational genome science. I was involved in designing the course back in 2011, and helped teach the initial offering, but this year I was the primary instructor. The class turned out well (in my opinion), and the student feedback was overwhelmingly positive. However, it didn’t go perfectly, and the last couple of weeks have given me the opportunity to reflect and brainstorm ideas for improving the next offering of the class.

This class is a very hands-on class—we cover the very basics of the theory, but spend most of our time running software tools and critically evaluating their results. Students submit assignments as entries in a class wiki, including notes of what they did, results they got, and interpretation / analysis. Using a wiki not only facilitates my monitoring of student participation, but (more importantly in the long run) it encourages students to develop documentation/note-taking skills that will be a huge benefit to them in the future.

At the beginning of most class periods, I set aside a few minutes for the students to work on their wiki entries. I used a Python script to random group the students into pairs, and instructed each pair to edit each other’s wiki entries—inserting notes or questions when something wasn’t clear, making minor stylistic improvements, etc. Then after this short period, another Python script randomly selected one of the students to come up and share what their partner had done and the results they had gotten. The hope was that these activities would motivate the students to complete the activities on time, and that they would actually critically evaluate each other’s work (and hopefully even learn something new in the process!).

The biggest problem I had with this approach is that the course was offered as a block course—that is, 3 credits and a full semester’s worth of work packed into 1.5 credits and 8 weeks. We were going at such a pace that the students often hadn’t had time to complete their assignments by the time they were scheduled to be editing each other’s wiki entries. Thankfully, future offerings of the course will be 3 credit semester-long ordeals, giving us more time to cover the same materials in more depth without being so rushed.

However, this was not the only problem. I got the impression that the students were not taking the opportunity to evaluate and present each other’s work seriously. Of course these students were busy, and as might be expected they need proper motivation to engage in these types of activities. After the first few class periods, it seemed that the threat of looking unprepared in front of the other students was not sufficient motivation to write complete, well documented wiki entries and/or to critically assess each other’s entries.

Here are a few ideas I’ve come up with to improve this situation for the next offering of this class.

• For most of the (half-)semester I randomly chose a single student to come up and present their partner’s work. Near the end of the term, I decided to randomly select pairs instead. Each student would still present what their partner had done, but their partner would be there to clarify any misunderstandings and provide support. This seemed to work much better, and is going to be my approach from day one next time.
• I provided verbal feedback for beginning-of-class presentations, and occasional verbal feedback for wiki entries, but as the majority of their grades were derived from a term project I provided no formal assessment throughout the term. I still like the idea of postponing formal assessment until the end of the semester to provide students ample time to polish up their wiki entries, but students need more than just verbal feedback in the interim. Next time I think I’ll have students grade each other’s beginning-of-class presentations (my own personal evaluation will be factored in as well). As was the intent before, all students will be motivated to be prepared for class (since they don’t know beforehand who will present), and they will be motivated to take advantage of the first few minutes of class to review each other’s entries (giving more than just a superficial glance). Peer evaluations will be included in each student’s final grade, and students will also get credit for providing evaluations. Hopefully this will provide motivation for the students to engage in each aspect of the group experience.
• As much as I hate enterprise Learning Management Systems, I’ll probably end up having students post peer evaluations to the university’s LMS. I’ll make the evaluation an assignment, and only make it accessible in the LMS for a very short period of time during class. Also, a keyword will be associated with each peer evaluation, so that students who are not present in class cannot get credit just by signing in at the appropriate time and entering arbitrary values (barring bold and coordinated dishonesty). If a student is absent when he/she is selected to present, their peers will be instructed to give them a 0.
• I understand that even with the best intentions, students cannot make it to every class period. However, I don’t want to have to be in a position to judge whether a certain absence was “excused” and make manual adjustments to participation- and peer-evaluation-based grades. Clearly, a parent’s funeral is a satisfactory excuse and a Justin Bieber concert the night before is not, but there is a lot of gray area in between. Rather than handling these case-by-case, I will allow for students to have two absences without impacting their grade–2 missed peer evaluations, 2 missed presentations, or 1 of each. They can use these absences however they please, but there will be no exceptions beyond that so they will be encouraged to use them wisely.

I’m really looking forward to teaching this class again, and I hope these ideas will make it an even better learning experience for everyone next time around!

# Making LaTeX easier with Authorea and writeLaTeX

When I’m curious about exploring a new technical skill (such as a new programming language, a software tool, a development framework, etc), I typically try to integrate its use into my normal work schedule. I select several tasks that I have to do anyway, and force myself to use this new skill to solve the task. It ends up taking more time than it would have if I had just stuck with skills I was already comfortable with, but in the end I’m usually better for it. Sometimes, I love my new-found skill so much that I begin using it every day in my research. Often, however, it just becomes another addition to my technical “toolkit”, increasing my productivity and really enabling me to choose The Best Tool for the Job for my future work.

This was my experience with $\LaTeX$. As an undergraduate, I had seen several colleagues use it and had fiddled with it a bit myself. It wasn’t until later though, as a first year grad student, that I really buckled down and forced myself to learn it while writing a paper for a computational statistics class. Yielding control of text and image placement to the LaTeX typsetting system took some getting used to, but I soon came to appreciate the quality of documents I could produce with it. Focusing on the concerns of content and presentation separately, as I had previously learned to do in web development, was another big bonus I recognized early on. The fact that LaTeX source documents are plain text made it easy to maintain a revision history with tools like svn and git, which I had also come to appreciate early on in my graduate career. And, of course, there is absolutely no comparison to typesetting mathematical formulae on LaTeX versus on Microsoft Word. See this thread for a great discussion on the benefits of LaTeX over Word.

I strongly encourage all of my colleagues to consider using LaTeX for their next publication. That being said, I understand that there is a bit of a learning curve with LaTeX, and setup/installation isn’t trivial for a beginner (unless you’re running Linux). However, I’ve seen a couple of web applications recently that should make the jump from Word to LaTeX much easier. Authorea and writeLaTeX are both web-based systems for authoring documents using LaTeX markup. While editing, Authorea renders the markup in HTML and only shows plain text for the section you are currently editing (of course, the final document is downloaded in PDF format). writeLaTeX uses a different approach: a window for editing the LaTeX markup, and another window for previewing the typeset PDF file.

Both of these applications are very easy to use. Both enable you to collaboratively edit with colleagues. And both are free to use. If you’re still using Microsoft Word to write your research manuscripts, consider learning LaTeX and getting your feet wet with one of these new tools!

# Frustration with Word and Endnote on Mac

Recently, I’ve been using Microsoft Word and EndNote to write a significant paper for the first time in several years (my advisor and I used LaTex + git mor my first first-author paper). After using it on my MacBook for several weeks with no more than the usual amount of frustration one can expect from EndNote and Word, EndNote stopped working all of a sudden. Every time I tried to insert a reference, it would get frozen at the “Formatting Bibliography” step and hang indefinitely. Force-quitting and restarting the programs didn’t seem to help anything.

After a bit of searching, I came across this thread which provides a simple solution. The culprit for the unstable behavior seems to ba an OS X system process called appleeventsd, and force quitting the process with this name using the System Monitor restored normal Word/EndNote behavior. I have done this several times in the last couple of weeks and haven’t seen any adverse side effects, so I will continue to do so until something goes wrong or some OS 10.8 upgrade provides better stability or until my collaborators magically decide that LaTeX + git + BitBucket is in fact a superior solution after all!

# A new home for BioWi[sz]e (and why you shouldn’t post code on your lab website)

A recent post by Stephen Turner about the woes of posting code on your lab website really resonated with me. As a scientist I have occasionally clicked on a link or copy&pasted a URL from a paper, only to find that the web address I’m looking for no longer exists. Sure it’s frustrating in the short term, but in the long term it’s troubling to think that so much of the collective scientific output has such a short digital shelf life.

This happened to me again just yesterday. I was looking over this paper on algorithms for composition-based segmentation of DNA sequences, and I was interested in running one of the algorithms. The code, implemented in Matlab (good grief), is available (you guessed it!) from their lab website: http://nsm.uh.edu/~dgraur/eran/simulation/main.htm. Following that link takes you to a page with a warning that the lab website has moved, and if you follow that link you end on the splash page for some department or institute that has no idea how to handle the redirect request. This paper is from 2010, and yet I can’t access supplements from their lab website! I did a bit of Google searching and found the first author’s current website, which even included links to the software mentioned in the paper, but unfortunately the links to the code point to the (now defunct) server published in the paper. I finally found the code buried in a Google Code project, and now I’m sitting here wondering whether it was really worth all the hassle in the first place, and whether I even want to check if our institution has a site license for Matlab…

Ok, `</rant>...`

With regards to my own research, I’ve been using hosting services like SourceForge, Github, and BitBucket to host my source code for years. However, I’ve continued using our lab server to host this blog, along with all the supplementary graphics and data that go along with it. I guess I initially enjoyed the amount of control I had. But after reading Stephen’s post, realizing how big of a problem this is in general, and of course thinking of all of the fricking SELinux crap I’ve had to put up with (our lab servers run Fedora and Red Hat), the idea of using a blog hosting service all of a sudden seemed much more reasonable.

So as of this post, the BioWi[sz]e blog is officially migrated to WordPress.com. Unfortunately, someone got the http://biowise.wordpress.com subdomain less than a year ago—they even spent the \$25 bucks to reserve a `.me` domain, and yet they’re doing nothing with it. Grrr…So anyway, the BioWise you know and love is now BioWize, for better and for worse.

As far as the supplementary graphics and data files, I have followed Stephen Turner’s example and posted everything on FigShare. While uploading data files and providing relevant metadata was very straightforward, there is a bit of a learning curve when it comes to organizing and grouping related files. And once data is publicly published on FigShare, deleting it is not an option, even if you’re just trying to clean things up and fix mistakes. So if I could have done one thing differently, I would have been more careful about how I uploaded and grouped the files. Otherwise, I have no complaints. I love the idea that the content of my blog will be accessible long after I’ve moved on from my current institution (without any additional work on my part), and that all of the supporting data sets are permanently accessible, each with its own DOI.