Small data sets for bioinformatics instruction

This semester I am teaching a computational genomics class for the third time. I’ve learned a lot from the previous iterations of the course, but one thing we’ve struggled with each year is the time/focus balance between technical tasks (such as What are the mechanics of running this genome assembler? or How can I execute this command on all of my sequence data files?) and science tasks (such as How do I evaluate the quality of this gene annotation?). Obviously in a class like this, answering any science question requires some technical work, and unfortunately most of the students have very little training in basic data and computing literacy.

One way I have attempted to improve the situation this semester is to provide small exercises for each unit that involve small data sets that run very quickly. I have done this to some extent in previous years, but I have been much more consistent and intentional this semester, encoding it in the formal class structure. The idea is that each unit has an exercise and an assignment. The exercise is intended to be executed by rote, and is there to give each student immediate feedback on whether they are doing things correctly. Only when they are comfortable completing the exercise (and understanding the purpose of each step) can the student proceed to complete the full assignment, which involves larger data sets and more critical thinking and intellectual input on their part.

What I wanted to highlight in this post, however, are the tools I use for creating small data sets for preliminary exercises.

  • wgsim: This package is the little read simulator that could. It does one thing, and it does it extremely well. The code hasn’t been updated since 2011, but it’s written in C and compiles nicely on pretty much any UNIX system. I’ve found simulating reads to be very helpful for mapping exercises, although they could also be used for assembly as well. I essentially selected a 150 kb region of a genome of interest and used wgsim to simulate something like 10k-100k paired-end reads. This amount of data is enough to create some read pileups so that tview analysis isn’t completely boring, but it’s a small enough data set that bwa or bowtie will run in seconds.
  • This script from the khmer package performs a “digital normalization” procedure to discard redundant data from regions of high-coverage. The 10-second elevator pitch for diginorm is that it removes data without removing information. It is not uncommon for this procedure to discard 75-90% of the data in a Fastq file, which will allow assembly jobs to run much more quickly and in much less memory.
  • This script, also part of khmer, uses reservoir sampling to pull out a random subset of reads from a Fasta/Fastq file. If you have done digital normalization and the data set is still not small enough, I would recommend using this script to discard even more data. Of course, this type of data reduction is (intentionally) lossy, so any downstream interpretations (if any) will have to be adjusted.

Running bioinformatics can often be challenging, especially for the uninitiated. Critically evaluating bioinformatics software is even harder, but I have found these tools to be invaluable resources for both teaching and research purposes.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s