I’ve been following a weeks (months?) long social media discussion on research software that has been very thought-provoking. The questions being discussed include the following.
- What should we expect/demand of software to be “published” (in the academic sense)?
- what should community standards of replicability/reproducibility be?
- Both quick n’ dirty prototypes and robust, well-tested platforms are beneficial to scientific research. How do we balance the need for both? What should our expectations be for research software that falls into various slots along that continuum?
I’ve been hoping to weigh in on the discussion with my own two cents, but I keep on finding more and more great reading on the topic, both from Twitter and from blogs. So rather than writing (and finish formulating!) my opinions on the topic(s), I think I’ll punt and just share some of the highlights from my readings. Linkfest below.
When I’m curious about exploring a new technical skill (such as a new programming language, a software tool, a development framework, etc), I typically try to integrate its use into my normal work schedule. I select several tasks that I have to do anyway, and force myself to use this new skill to solve the task. It ends up taking more time than it would have if I had just stuck with skills I was already comfortable with, but in the end I’m usually better for it. Sometimes, I love my new-found skill so much that I begin using it every day in my research. Often, however, it just becomes another addition to my technical “toolkit”, increasing my productivity and really enabling me to choose The Best Tool for the Job for my future work.
This was my experience with . As an undergraduate, I had seen several colleagues use it and had fiddled with it a bit myself. It wasn’t until later though, as a first year grad student, that I really buckled down and forced myself to learn it while writing a paper for a computational statistics class. Yielding control of text and image placement to the LaTeX typsetting system took some getting used to, but I soon came to appreciate the quality of documents I could produce with it. Focusing on the concerns of content and presentation separately, as I had previously learned to do in web development, was another big bonus I recognized early on. The fact that LaTeX source documents are plain text made it easy to maintain a revision history with tools like svn and git, which I had also come to appreciate early on in my graduate career. And, of course, there is absolutely no comparison to typesetting mathematical formulae on LaTeX versus on Microsoft Word. See this thread for a great discussion on the benefits of LaTeX over Word.
I strongly encourage all of my colleagues to consider using LaTeX for their next publication. That being said, I understand that there is a bit of a learning curve with LaTeX, and setup/installation isn’t trivial for a beginner (unless you’re running Linux). However, I’ve seen a couple of web applications recently that should make the jump from Word to LaTeX much easier. Authorea and writeLaTeX are both web-based systems for authoring documents using LaTeX markup. While editing, Authorea renders the markup in HTML and only shows plain text for the section you are currently editing (of course, the final document is downloaded in PDF format). writeLaTeX uses a different approach: a window for editing the LaTeX markup, and another window for previewing the typeset PDF file.
Both of these applications are very easy to use. Both enable you to collaboratively edit with colleagues. And both are free to use. If you’re still using Microsoft Word to write your research manuscripts, consider learning LaTeX and getting your feet wet with one of these new tools!
Recently, I’ve been using Microsoft Word and EndNote to write a significant paper for the first time in several years (my advisor and I used LaTex + git mor my first first-author paper). After using it on my MacBook for several weeks with no more than the usual amount of frustration one can expect from EndNote and Word, EndNote stopped working all of a sudden. Every time I tried to insert a reference, it would get frozen at the “Formatting Bibliography” step and hang indefinitely. Force-quitting and restarting the programs didn’t seem to help anything.
After a bit of searching, I came across this thread which provides a simple solution. The culprit for the unstable behavior seems to ba an OS X system process called appleeventsd, and force quitting the process with this name using the System Monitor restored normal Word/EndNote behavior. I have done this several times in the last couple of weeks and haven’t seen any adverse side effects, so I will continue to do so until something goes wrong or some OS 10.8 upgrade provides better stability or until my collaborators magically decide that LaTeX + git + BitBucket is in fact a superior solution after all!
A recent post by Stephen Turner about the woes of posting code on your lab website really resonated with me. As a scientist I have occasionally clicked on a link or copy&pasted a URL from a paper, only to find that the web address I’m looking for no longer exists. Sure it’s frustrating in the short term, but in the long term it’s troubling to think that so much of the collective scientific output has such a short digital shelf life.
This happened to me again just yesterday. I was looking over this paper on algorithms for composition-based segmentation of DNA sequences, and I was interested in running one of the algorithms. The code, implemented in Matlab (good grief), is available (you guessed it!) from their lab website: http://nsm.uh.edu/~dgraur/eran/simulation/main.htm. Following that link takes you to a page with a warning that the lab website has moved, and if you follow that link you end on the splash page for some department or institute that has no idea how to handle the redirect request. This paper is from 2010, and yet I can’t access supplements from their lab website! I did a bit of Google searching and found the first author’s current website, which even included links to the software mentioned in the paper, but unfortunately the links to the code point to the (now defunct) server published in the paper. I finally found the code buried in a Google Code project, and now I’m sitting here wondering whether it was really worth all the hassle in the first place, and whether I even want to check if our institution has a site license for Matlab…
With regards to my own research, I’ve been using hosting services like SourceForge, Github, and BitBucket to host my source code for years. However, I’ve continued using our lab server to host this blog, along with all the supplementary graphics and data that go along with it. I guess I initially enjoyed the amount of control I had. But after reading Stephen’s post, realizing how big of a problem this is in general, and of course thinking of all of the fricking SELinux crap I’ve had to put up with (our lab servers run Fedora and Red Hat), the idea of using a blog hosting service all of a sudden seemed much more reasonable.
So as of this post, the BioWi[sz]e blog is officially migrated to WordPress.com. Unfortunately, someone got the http://biowise.wordpress.com subdomain less than a year ago—they even spent the $25 bucks to reserve a
.me domain, and yet they’re doing nothing with it. Grrr…So anyway, the BioWise you know and love is now BioWize, for better and for worse.
As far as the supplementary graphics and data files, I have followed Stephen Turner’s example and posted everything on FigShare. While uploading data files and providing relevant metadata was very straightforward, there is a bit of a learning curve when it comes to organizing and grouping related files. And once data is publicly published on FigShare, deleting it is not an option, even if you’re just trying to clean things up and fix mistakes. So if I could have done one thing differently, I would have been more careful about how I uploaded and grouped the files. Otherwise, I have no complaints. I love the idea that the content of my blog will be accessible long after I’ve moved on from my current institution (without any additional work on my part), and that all of the supporting data sets are permanently accessible, each with its own DOI.
This last Friday I attended a Preparing Future Faculty seminar/conference held on campus. The event featured a couple of keynote speakers, along with several panels focused on such topics as career options, teaching strategies, and navigating the job market. The experience was no doubt helpful and informative, but also a bit scary seeing as the academic job market has become so competitive.
I’m using this post to record some of my personal notes, which may or may not be of use or interest to anyone else in the world.
It’s no secret that the supply of PhD graduates in the US far outweighs the demand for tenure-track faculty at research-intensive universities. The first panel (following the welcoming remarks) focused on different career options. For those dedicated to securing tenured faculty status at a research university, the recommendations focused very much on communicating and networking—it’s important to be able to share what it is you do with others, and you need to seek out opportunities to do so, early and often. You need to prepare yourself for both formal and informal opportunities to share your interests and experience with others. However, you should not memorize your remarks, as this will make it harder for your enthusiasm and passion for your work to come through.
Regarding postdoctoral fellowships in the sciences, the recommendation was to not view this as a no-man’s-land between graduate school and faculty status, but as an opportunity to gain valuable experience as a researcher where you dictate the balance between research, teaching, service, and administration (and frequently that balance is almost exclusively research). Perhaps the biggest challenge at this stage is to demonstrate your ability to conceive and deliver on research ideas independent of other scientists. They also focused on the importance of contacts and the social aspects of securing the “right” positions: the typical student-advisor and postdoc-advisor relationship in the US lasts longer than the typical marriage in the US, so it’s an important choice. It’s important to do all you can to get your name and ideas out there and make the social connections necessary that will give you options and opportunities when it comes time to apply for postdoctoral or faculty positions.
The demands of being faculty at a teaching-focused university are different from those at a research-focused institution. The students expect a lot more in terms of interaction (which was described as both a challenge and a benefit), and of course the teaching load is very demanding. While there are opportunities to maintain a research lab at such schools, publications will be fewer as there is an increased interest in teaching undergrads how to do research.
The panel also included a faculty member from university-sponsored applied research center. The focus of this center is engagement with local/regional government, business, and community, and bridging the gap between theory and existing practice in their particular field. The teaching and research opportunities for this kind of position are quite different from those of traditional faculty, and stable long-term funding seemed to be a moderate concern. However, for the right kind of person, the opportunity to work with people to apply research to solve practical, everyday problems that families and businesses and governments deal with every day can be very rewarding.
There seemed to be an agreement that there is a trend towards healthier institutional attitudes about the importance of teaching and service as opposed to just research. That being said, if you are tenure-track faculty at a Research I university then research is still paramount. These days, poor teaching (even with a great research record) can put your tenure at risk, but even stellar teaching is not enough to compensate for poor or mediocre research. As teaching is a tremendously demanding time commitment, it is important to get experience with this and to seek feedback and advice from others.
Much is expected of university faculty, in terms of teaching ability, administrative responsibility, participation in (meaningful) service, and research productivity. One speaker mentioned that the ability to gracefully flow between these different demands, much as a yoga instructor seamlessly moves through a series of poses, is crucial to success as faculty. Others mentioned that for much of their careers, they exhibited little grace balancing these demands, but with determination and a lot of hard work and sweat they managed. The details of how to achieve this balance is a personal issue, but it’s important to consider how all of these things relate to you what you consider to be your higher purpose as university faculty.
Finally, while the focus was on balance of research, teaching, service, and administrative responsibilities, there was also a bit of discussion about personal life balance. As an academic, it is very easy to get sucked into the cycle of working every waking hour. One panelist, a social scientist, could even claim visits to the movie theater or playtime with her child as research time, since her research about gender roles and social interaction were always in the back of her mind. However, most will agree that taking time to be with friends and family and to pursue personal interests is important. Many of the panelists warned that it is never a good time to get married, never a good time to have a baby, but this shouldn’t stop you from doing so. Family is important, and you can adapt and still be a successful scientist with a young family.
The Social Aspect
I’ve already mentioned this, but almost every single presenter and panelist emphasized the importance of social and interpersonal considerations in the transition from graduate school to postdoctoral training to career. Finding the right opportunities of course depends on the quality of your research, but (at least initially) is much more about the contacts you make and who you know. Attending conferences as often as funds allow, proactively seeking opportunities to share your research with others, and making contact with remote colleagues to discuss challenges with your research are all ways to make contacts that can benefit your research in the short term and could lead to productive partnerships, mentorships, or connections in the long term.
I’ve followed several discussions (spats?) on Twitter recently regarding what quality standards should apply to bioinformatics code (this post and this post provide a decent summary). Relevant concerns include why scientists do not more frequently open-source their code, whether they should, whether there is funding to do so, whether there should be funding to do so, whether doing so is necessary to replicate a computational method, and whether code necessarily needs to be of distribution quality for release.
This is a complicated issue and I definitely don’t have all the answers. However, I want to debunk the claim that complex system requirements (imported modules / libraries, system tweaks, etc) make installing and using scientific software (for review) more difficult. Ostensibly, yes, for some research code I can see how it could be onerous for a scientist to meticulously describe how to set up the system to be compatible with the software, and equally onerous for a user or reviewer to troubleshoot that procedure. However, virtualization technology provides an excellent solution to this problem. Why not use Virtual Box to set up a virtual machine with all of your pipeline’s prerequisites and distribute that as a supplement to the publication? (by the way, I had this idea before ENCODE released a virtual machine preloaded with code and data described in the publications…) I can’t imagine this taking more than a few hours, which is nothing compared to total amount of time it takes to draft, edit, and revise the corresponding manuscript. If a scientist is incapable of going through the steps to install their pipeline’s prerequisites on another machine and re-run the pipeline to verify the expected results, then either the pipeline is way too complex and kludgy to trust, or the scientist lacks the necessary experience. In either case, I don’t feel comfortable drawing conclusions from any results obtained from that pipeline.
Last week I stumbled upon a really cool application called ascii.io. It provides a nice mechanism for capturing a terminal session and then subsequent video playback. It is implemented as a Python script which seems to be little more than a glorified keylogger. It records what you type (as well as the timing of each keystroke), as well as what is printed out to the terminal. When you finish recording your session, all of this logged information is pushed to the ASCII.IO server and a unique ID and URL is created for the session just recorded.
Here’s a cute little sample I just created.
So how does this relate to biology? It’s no secret that high-throughput sequencing and other “big data” technologies are changing the way biology is done. Many of the best tools for managing and analyzing biological data are available only through the command line. Familiarity with the command line and the ability to process and analyze data using command-line tools is a skill set that is increasing in demand every day, as evidenced by the number and frequency of introductory bioinformatics seminars/tutorials/workshops coming to a campus near you.
The command line is still a mysterious place for many a biologist. Static text or graphics on a web page can only go so far in explaining what the command line is and how it is used in biology and bioinformatics. A tool like ascii.io can make a boring tutorial come alive, making it much more clear what the user types versus what a command prints to the terminal, etc. The benefit of ascii.io over, say, a more traditional screen video capture program, is that storing the text associated with a terminal session is much less bulky than storing video data–especially at the resolution required to make a terminal session clear in a Flash player.
The ascii.io app is free, open source, light weight, and a balanced part of your next bioinformatics seminar/tutorial/workshop!
I had the opportunity to attend Cold Spring Harbor’s Genome Informatics conference this year. Here are a couple of my favorite highlights.
Michael Schatz’s presentation briefly mentioned metassembly, but a student at Notre Dame (a collaborator and former intern of Schatz’s) presented a poster dedicated to the subject. He implemented a program called Metassembler, which takes as input 2 different assemblies of the same data (perhaps from different assemblers, or the same assembler using two different parameter settings) to derive a consensus assembly of superior quality to the two input assemblies.
When speaking to the student presenting the poster, he said it would be a couple of weeks before the code was ready for distribution. Given that their wiki has not been updated since before the conference, I’m not holding my breath…although I will be very interested to try this software out when it is available.
Another poster I enjoyed was presented by a student (undergrad?) of Dr. John Karro of Miami University Ohio. The student implemented a Hidden Markov Model to identify alternative sites of polyadenlyation in transcripts. The HMM was pretty simple, but I enjoyed discussing the relevant biology of which I was not previously aware.
One of the main sessions included a presentation about the Assemblathon genome assembly contest (since published in Genome Research). I don’t really remember much about which submissions/methods performed better than the others–what I enjoyed most about this presentation was the discussion about different comparison metrics they developed to measure the relative quality of the submitted genome assemblies. One I remember off the top of my head was the cc50 measure–the “correct contiguity” analog of the n50 measure. Essentially, cc50 measures the distance at which 50% of the contigs (or scaffolds?) in the assembly are situated correctly with reference to the other contigs. They defined several other metrics to assess a variety of important characteristics of assembly quality. This is something I will definitely be going to back to look at in more depth.
Steven Salzberg gave a presentation about the GAGE competition his research group conducted. Unlike the Assemblathon, which accepted community submission, the GAGE project was all conducted by Salzberg’s lab. Essentially, they tested a wide variety of already available genome assemblers on real data and tried to assess the relative performance of each assembler. Rather than trying to drive innovation, this project is trying to address practical questions commonly faced by biologists in this new information age of biology.
The takeaway message I got from the presentation is that using traditional assembly quality metrics (n50, n90, longest scaffold length, etc), SOAPdenovo consistently generated the best results, followed by AllPathsLG. However, for each assembly, there was a high-quality reference assembly available for comparison, so they also assessed the quality of the assemblies when contigs and scaffolds were split in regions containing large amounts of error. For these corrected assemblies, AllPathsLG consistently provided the best performance, followed by SOAPdenovo. At the end of the day, SOAPdenovo provides the largest, (perhaps) most complete assemblies, while AllPathsLG provides assemblies that may be a bit smaller but have far fewer errors.
I enjoyed many other presentations and posters, but I only have so much time to sit and reflect on them now!
A few months ago, I sat down with a post-doc and we made a list of TE prediction software. We came up with over 20 programs, scripts, etc, and got to work trying to download, install, and use these various software tools. This was perhaps the most frustrating experience I’ve had in grad school to date.
A select few programs were well documented and “just worked” exactly as advertised. More often though, the program documentation was unclear, redundant, contradictory, and simply insufficient. A few programs were even missing documentation altogether! Although we had a list of over 20 programs, we were only able to get results from 6 of them after several weeks of trying.
At one point during these few horrid weeks, I stormed into the office of one of my professors and just vented about how frustrating it had been. He was very patient with me and helped me talk it out, and I was able to get back to work soon. However, I made a promise to myself that day that I will never cause anyone that amount of grief by writing crappy software, incomplete documentation, or research that is not completely and easily reproducible.
Recently, I had the first meeting with my PhD committee and as part of my Description of Proposed Research, I decided to state this goal explicitly in a section called Research Philosophy.
Much of my dissertation work will involve developing new tools and methodologies for genomics research. My goal is to make all of this work accessible, usable, and reproducible by the scientific community. Of course this philosophy is not unique to me, as it is implicit in the scientific method. My reason for making this goal explicit during the initial stages of my research is to commit myself to a higher standard than what may minimally be expected for graduation.
The following provides my specific plans for achieving the goal of accessibility, usability, and reproducibility with my research.
- Use permissive open-source licensing
- Host source code, data, other supplements externally
- Maximize software portability; compatibility with all POSIX-like systems preferred, but compatibility with all Linux systems as a minimum
- Provide clear, accurate documentation
- Eliminate complicated installation procedures
- Reduce external dependencies
- Provide simple examples
- List all parameter values used for more complicated examples or use cases
- Provide accurate accession numbers for all data used
Bioinformatics and computational biology are very appealing to a certain type of scientist. Unanswered questions in the life sciences are some of the most interesting and important in the world, and bioinformatics offers a novel approach to answering these questions. This novel approach requires a firm grounding in multiple disciplines, ranging from molecular biology to computer science to statistics to genetics to engineering and everything in between.
Because bioinformatics is such an interdisciplinary field, I often find myself wishing I had more time to learn more things so that I could be more of an “expert” in more areas related to bioinformatics. I do a lot of programming, but I wish I had the time to truly develop the skills of a professional software engineer. I use statistics frequently in my research, but I often lack intuition when it comes to solving even basic problems of probability. Most of my research is related to genetic and genomic sequence data, but my bench skills are so rusty that I wouldn’t trust myself to handle expensive sequencing reagents until I’ve had a few months’ refresher in the lab. The problem is that I don’t have enough time to do a PhD in 3 (or 7!) different traditional scientific disciplines.
I came across this article in PLoS the other day that I thought really addressed this issue well. The author, like me, has a skill set distributed across several “traditional” scientific disciplines. He is an “expert” in his area of research, but his area of research does not fall nicely into any one category. He suggests that just because someone is not an “expert” in any one traditional discipline does not mean they cannot make useful individual contributions to science. In fact, they can make unique individual contributions that otherwise would require the collaboration of “experts.” It’s definitely a good read and gave me some confidence as I press forward in my interdisciplinary (or should I say antedisciplinary) graduate studies.