Category: Data

July 24, 2015

Basic summary stats for all genomes at NCBI

I’m wrapping up a manuscript this week (woo hoo!), and at one point was looking for a reference for the claim that a genome-wide GC content of ≈30% is one of the lowest observed in any animal genome. I looked at a couple of recently published animal genomes, and couldn’t even find a mention of the genome’s overall nucleotide composition. I know all of the data is accessible from NCBI, but there are hundreds of published animal genomes, and I was not keen on downloading all those genome sequences and computing GC content. Plus, I mean come on, NCBI should have this type of information easily accessible somewhere, right?

It turns out my intuition was correct. There is a “Browse by organism” link on NCBI’s genomes page that brings up a tabular readout of basic summary stats for all genome assemblies submitted to NCBI. Finding the information I needed was as simple as clicking the “Eukaryotes” tab, filtering by group (“Animals”), and sorting by %GC content (they also provide info on genome size and content as well). After skipping the references with missing values for %GC content, I was able to confirm the claim, so I referenced this resource in the paper.

June 18, 2015

Pandas: querying a data frame and the `SettingWithCopyWarning` message

I’m doing more and more of my data analysis with Python Pandas these days, but I’m far from an expert and still much more comfortable with R. The reason I persist, though, is that Python provides a complete data processing and analysis stack. In the past, I’ve typically used Perl or Python scripts to pre-process data, R to do statistics and visualization, and shell scripts to glue it all together. Python’s rapidly maturing libraries for numerics, statistics, and data analytics, together with an active community and the awesome IPython notebook, make the prospect of doing end-to-end analysis completely in Python quite compelling.

That said, I still have a lot to learn regarding Pandas. Earlier today I was doing some simple operations—taking a subset of a data frame and creating a new column by applying a transformation to another column—and I kept getting a cryptic SettingWithCopyWarning message. Here’s a dummy example that reproduces the message.

>>> # Load data into memory
>>> import pandas
>>> exons = pandas.io.parsers.read_table('test.dat')
>>> exons
                 Species  ID  Length   Context
0        Equus monoceros  e1     162      stop
1          Homo diutinus  e2     111     start
2  Draconis occidentalis  e3      51       cds
3    Basiliscus vulgaris  e4     114  complete
4          Equus pegasus  e5      95       utr
>>>
>>> # Subset the data
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')]
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> import math
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
__main__:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
>>>

I went on to plot the data and there didn’t seem to be any problems. I also followed the link provided in the warning message and read the documentation. If I’m reading things correctly, the warning is issued to inform users when they might be operating on a copy of the original data, rather than the original data itself. In this example, if I make changes to the ungulates variable and want those changes reflected in the original exons variable, then I do not want to be operating on a copy of the data but the original data itself.

However, in my case it wasn’t really a huge concern, so I was able to get rid of the warning message by making my intent to work on a copy of the data explicit.

>>> # Subset the data, but make an explicit copy
>>> ungulates = exons.loc[(exons.Species == 'Equus pegasus') |
                          (exons.Species == 'Equus monoceros')].copy()
>>> ungulates
           Species  ID  Length Context
0  Equus monoceros  e1     162    stop
4    Equus pegasus  e5      95     utr
>>>
>>> # Create a new `LogLength` column
>>> ungulates['LogLength'] = ungulates['Length'].apply(lambda x: math.log(x, 10))
>>>
>>> # No errors this time, woohoo!
>>>

If you’re interested in seeing the interactive Python interpreter in action, check out this asciicast.

May 21, 2015

GitHub now renders IPython/Jupyter Notebooks!

I’ve written before about literate programming and how I think this could be a big game changer for transparency and reproducibility in science, especially when it comes to data analysis (vs more traditional software engineering). Well, GitHub announced recently that IPython/Jupyter notebooks stored in GitHub repositories will be rendered, rather than presented as raw JSON text as before. This is a very nice development, making it even easier to share data analysis results with others!

April 21, 2015

Small data sets for bioinformatics instruction

This semester I am teaching a computational genomics class for the third time. I’ve learned a lot from the previous iterations of the course, but one thing we’ve struggled with each year is the time/focus balance between technical tasks (such as What are the mechanics of running this genome assembler? or How can I execute this command on all of my sequence data files?) and science tasks (such as How do I evaluate the quality of this gene annotation?). Obviously in a class like this, answering any science question requires some technical work, and unfortunately most of the students have very little training in basic data and computing literacy.

One way I have attempted to improve the situation this semester is to provide small exercises for each unit that involve small data sets that run very quickly. I have done this to some extent in previous years, but I have been much more consistent and intentional this semester, encoding it in the formal class structure. The idea is that each unit has an exercise and an assignment. The exercise is intended to be executed by rote, and is there to give each student immediate feedback on whether they are doing things correctly. Only when they are comfortable completing the exercise (and understanding the purpose of each step) can the student proceed to complete the full assignment, which involves larger data sets and more critical thinking and intellectual input on their part.

What I wanted to highlight in this post, however, are the tools I use for creating small data sets for preliminary exercises.

wgsim: This package is the little read simulator that could. It does one thing, and it does it extremely well. The code hasn’t been updated since 2011, but it’s written in C and compiles nicely on pretty much any UNIX system. I’ve found simulating reads to be very helpful for mapping exercises, although they could also be used for assembly as well. I essentially selected a 150 kb region of a genome of interest and used wgsim to simulate something like 10k-100k paired-end reads. This amount of data is enough to create some read pileups so that tview analysis isn’t completely boring, but it’s a small enough data set that bwa or bowtie will run in seconds.
normalize-by-median.py: This script from the khmer package performs a “digital normalization” procedure to discard redundant data from regions of high-coverage. The 10-second elevator pitch for diginorm is that it removes data without removing information. It is not uncommon for this procedure to discard 75-90% of the data in a Fastq file, which will allow assembly jobs to run much more quickly and in much less memory.
sample-reads-randomly.py: This script, also part of khmer, uses reservoir sampling to pull out a random subset of reads from a Fasta/Fastq file. If you have done digital normalization and the data set is still not small enough, I would recommend using this script to discard even more data. Of course, this type of data reduction is (intentionally) lossy, so any downstream interpretations (if any) will have to be adjusted.

Running bioinformatics can often be challenging, especially for the uninitiated. Critically evaluating bioinformatics software is even harder, but I have found these tools to be invaluable resources for both teaching and research purposes.

March 27, 2015

In silico empirical estimation of paired-end insert size

I’m no expert in library preparation for DNA sequencing, but as I understand it, protocols for paired-end and mate-pair sequencing involve a size selection step that selects DNA fragments within a tight target size range. By controlling the size of sequenced DNA fragments (or “inserts”, borrowing terminology from the days when DNA fragments were actually inserted into vectors and clonally amplified), assembly and alignment software are better able to place reads in their correct locations.

Size selection is not perfect though, so even if you select for, say, 500 bp inserts, you’ll end up with a range of insert sizes. To accommodate this, assemblers and read aligners typically ask users to provide a mean and standard deviation for the insert size. What values do you provide? How wide is the length distribution?

I’ll be frankly honest and admit that for most of my career, I’ve simply provided the target insert size as the mean and a “reasonable” value for the standard deviation (please don’t ask me to define “reasonable” 🙂 ). There are, however, better ways to do this. Hopefully your sequencing center used gel electrophoresis to confirm that the size selection step didn’t fail miserably, and the position of the band in that gel, relative to a ladder, should make the mean insert size pretty clear.

But what if your sequencing center did not provide you with this information, or what if you want a second opinion? Are there any bioinformatics approaches that can help?

Sure! We can empirically determine the distribution of insert sizes by aligning the reads to the genome sequence and observing the outer distance of the paired read alignments. The protocol might look something like this.

# Align paired-end reads with BWA
bwa index genome.fa
bwa mem genome.fa reads1.fq reads2.fq > align.sam

# Convert the alignments to BAM format and sort with SAMtools
samtools view -bS align.sam | samtools sort -o align-sorted -

# Collect insert size metrics with Picard
java -jar /path/to/picard.jar CollectInsertSizeMetrics \
    HISTOGRAM_FILE=inserts-hist.pdf \
    INPUT=align-sorted.bam \
    OUTPUT=insert-metrics.out

# Graphic of the read length distribution will be in `inserts-hist.pdf`.

But wait, you object, I’m doing a de novo genome assembly. You’re telling me that I have to have my genome sequence to determine what insert size I should use for reconstructing my genome sequence?!?!

Yeah, I know. Assembling the genome and mapping the reads to that genome assembly both require you to specify insert sizes, so it’s a bit of a chicken-and-egg problem. However, genome assembly is hard, and it’s unlikely that you’ll get it right on your first try. Assuming that your first assembly attempt isn’t a complete and utter failure, you should be able to use that as a reference to align your reads and determine the insert size distribution, which you can then use for subsequent assemblies.

In fact, I would recommend doing this as soon as possible after receiving your data to ensure there weren’t big issues with the sequencing. If your 10 kb mate pair library turns out to be mostly 300 bp inserts, and you report this to your sequencing center a month after they give you your data, chances are they can do something about it. However, if it takes you 2 years to assemble and annotate the genome, and only then you check this, there’s probably nothing that can be done.

March 26, 2015

The fastest darn Fastq decoupling procedure I ever done seen

There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…

The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.

I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.

The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.

I’ve blogged before about how amazingly useful the paste command can be for working with Fastq data. In fact, combining paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!

paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) \
    | tr '\t' '\n' \
    > reads-int.fastq

However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the split-paired-reads.py script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses paste and process substitution, but it also takes advantage of the oft-overlooked tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).

paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

This procedure did the conversion in 2.5 minutes, while I terminated the incomplete split-paired-reads.py procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.

*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.

‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.

March 23, 2015

sequniq: tools for identifying and managing duplicated sequences

In next-generation sequence data, duplicates of non-biological origin (such as PCR duplicates) are fairly common and can complicate tasks such as assembly and expression quantification, and jeopardize the accuracy of the results. If reads are sampled at random from a genome, it’s very unlikely that two reads will be sampled from exactly the same position. For paired-end data especially, there is essentially no other explanation for two read pairs being identical other than some kind of amplification or other technical artifact. Obviously, it’s important to scan our NGS data for duplicates and remove them as part of our preliminary quality control. This is a pretty simple task, so software solutions should abound, right? Right!?!?

Well, it’s very possible that I’m missing something, but I’ve only been able to find a single tool for removal of duplicate reads (FastUniq, described in this PLoS ONE paper). I did see tools for removal of duplicate read alignments, but that’s not going to help much if you’re working on a de novo assembly. I went ahead and downloaded FastUniq, ran it on my data, and verified the results. The user interface was a bit kludgy, but boy have I seen worse!! At least it was well documented and worked as advertised.

Later, I was working with two different annotated genome assemblies for the same organism, which I had parsed into individual loci based on gene content. I was interested in determining how many loci were strictly identical between the two assemblies. As I started writing code to do this task, I realized how similar it was to the simple task of removing duplicate reads. In duplicate read removal, we store read sequences in memory, and then whenever we encounter a sequence we’ve already seen before we discard it. Now, for identifying identical loci, the task was similar, but instead I wanted to report “duplicates” instead of discarding them¹.

After realizing this, I took a few minutes to extend and generalize my code so that it could handle both use cases. I then tested it on a data set I had just processed with FastUniq. My code ran in 29 minutes, whereas FastUniq had required 69 minutes. This does not include the time required to convert between paired and interleaved data (FastUniq requires paired files, my code requires interleaved), nor does it include the time required to shuffle the FastUniq output². So some simple of-the-cuff Python code was able to outperform the only tool I could find for removing duplicates de novo.

I wrapped up my code as a Python package and posted it on Github. It still needs a bit of polishing, which I may or may not ever get around to doing.

¹Because the latter approach involves storing longer genomic sequences, I wrote the code to store SHA1 hashes of each sequence, rather than the sequence itself. This drastically reduced memory consumption without an appreciable increase in runtime.

²For some reason that I did not understand from my very cursory skimming of the FastUniq paper, the FastUniq tool includes a sorting step. Therefore, the first read of the output is the read with the most consecutive adenines (As) at the beginning of the read. Looking at the output set off all kinds of red flags and alarms in my brain. Anyone who has taken a course on data structures or algorithms knows that weird things can happen performance-wise when unsorted data is expected but sorted data is provided (and all NGS analysis tools expect unsorted data). I would not be comfortable using FastUniq’s output directly without first shuffling the data, which only increases the margin of improvement of my Python code over FastUniq!

March 23, 2015

Task automation with bash and parallel

Nobody gains significant experience in genome informatics without the drudgery of running the same command(s) over and over again on a set of files. Everybody has a different tolerance level for tedious tasks—I’m often surprised at the patience some people have to sit there and manually re-run the same command dozens of times. Regardless of your pain threshold, however, you’ll eventually run into a data set that is simply too large to be analyzed or processed piece by piece. Luckily, there are several approaches to task automation that 1) are very simple to use and 2) are portable across most UNIX and UNIX-like systems such as Linux and Mac OS X. This post will give a brief introduction into some of the techniques I have picked up over time.

The for loop

The easiest way to automate repetitive command-line tasks is with the bash for loop. Assuming you’ve exercised at least a minimal amount of discipline in naming your data files, it should be straightforward to apply the same command to many different data files.

The basic structure of a for loop is as follows.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt; done

If you’re putting the commands in a script (or describing them on a blog 🙂 ) it often increases readability if you split it across multiple lines like so.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt
done

The code in the middle gets executed once for each value specified on the first line. That’s it! That’s a for loop. If you’d like a more long-winded version, here’s the break down.

for i in 1 2 3 4 5: This syntax is used to tell the shell that we want to repeat a command multiple times using multiple values. In this case the values are the integers 1 – 5, but these could be anything: names, sample labels, whatever. For this specific example, we could use bash range notation and replace the numbers with {1..5}, which is especially useful if you have a large number of serially numbered files (i.e. file1.txt, file2.txt, file3.txt, and so on).
do someCommand data$i.fastq > output$i.txt: This is the command we want to run multiple times. In this particular example, the command will be executed 5 times, and each time the variable $i will be filled in with one of the values we specified earlier (first 1, then 2, etc). This example assumes that we have 5 input files: data1.fastq, data2.fastq, ..., data5.fastq, and will create 5 corresponding output files: output1.txt, output2.txt, ..., output5.txt.
done: This keyword indicates the end of the loop.

There are some additional things you might want to consider.

How you name your files can make a big difference. There is no programming trick on earth that can help you if your input files are named like data_1.fastq, data-2.fastq, data3.fastq, data_4.fq, and data_five.fastq. Although it’s easy for us as humans to see what the pattern is, simplifying this so that a computer can easily process it requires you to be consistent with your file naming schemes.
Embedding a variable in a filename or command that has underscores can be problematic, since the underscore is a valid character for bash variables. Let’s take the example from above, but imagine instead the input files are named like data_1_trim.fq, data_2_trim.fq, and so on. We might be tempted to wrap someCommand data_$i_trim.fq > output$i.txt in our for loop, but this wouldn’t work. Bash will interpret $i_trim as a variable, instead of just $i as was intended. If there is ever any possibility of confusion, it’s always better to use explicit syntax and wrap variables in braces like so: someCommand data_${i}_trim.fq > output${i}.txt.

The for loop, revisited

In a regular for loop, each command is executed sequentially. That is, the first iteration is executed, bash waits for it to finish, and then only when it is complete does the loop proceed to run the second iteration (and so on for all iterations). If our loop runs 5 command, and each takes 1 minute, then the regular for loop approach will take 5 minutes to run.

A very simple modification allows you to run all of the iterations simultaneously in parallel, without waiting for the first command to finish. This will allow us to run all 5 commands at once, getting all of our results in 1 minute (instead of 5). This is done by placing the ampersand character after the command in the for loop.

for i in 1 2 3 4 5; do someCommand data$i.fastq > output$i.txt & done

Here it is again in expanded form.

for i in 1 2 3 4 5
do
  someCommand data$i.fastq > output$i.txt &
done

There are a couple of things to consider here as well.

The ampersand character tells bash to run the command in the background. If you are typing these commands directly into your shell you should have no problems with this. However, if you have stored the commands in a script you will need to include a wait command after the loop. This is because bash scripts do not normally wait for background processes to finish before exiting. The wait command forces the bash script to wait for the loop to finish before exiting.
The ampersand provides a very simple and powerful extension of the for loop, but it requires caution. With this approach, each iteration of the loop will spawn a new process, and if the number of spawned processes exceeds the number of available processors/cores, this could lead to performance issues. Only use this approach if you are sure there are more processors on your machine than iterations in your loop.

The GNU parallel command

Although the bash for loop is simple and powerful, there are cases where it doesn’t work too well. This is primarily when you have loops with a large number of iterations and you want to speed up these iterations by using multiple processors, but the number of iterations is much more than the number of processors. For instance, you may have hundreds of samples you want to run some quality control command on, and your desktop has 16 processors. The normal for loop described above is not optimal because it will only use 1 processor at a time. The parallelized for loop described above doesn’t work because it will try to run all of the samples at once, quickly overloading the computer. We need a something that will run the command on all of our hundreds of samples, but only keep 16 processes running at a time. Enter the GNU parallel command.

Let’s continue with the scenario described above, but instead imagine we had 512 input data files instead of 5. Assuming they file names are numbered appropriately, we can process these files, 16 files at a time, with the following command.

parallel --jobs 16 someCommand data{}.fastq '>' output{}.fastq ::: {1..512}

If you take a moment to look at this, it’s very similar to the for loop.

Instead of specifying the iteration values at the beginning (1 2 3 4 5 or just {1..5}), we specify them at the end after the triple colon. Alternatively, parallel can read these from standard input or a file.
Instead of using a loop variable like $i, we simply use empty curly braces {}.

There are a few considerations to note here as well.

Note the file redirect symbol > in quotes. If you do not put this in quotes, it will be interpreted by the shell instead of the parallel command.
The GNU parallel command is relatively recent and is not available on many systems by default. If you’re on Linux, it should be fairly easy to find and install using your system’s package manager (apt-get, yum, and so on). On Mac you can install it using Homebrew.
Some versions of parallel may require you to add the --gnu flag for correct behavior. I have no idea which versions require this or why. Basically, if the command fails right away with an error message try adding or removing the flag.
The parallel command supports multiple arguments per command. This isn’t really helpful for the example discussed above, but check out the man page for parallel, and specifically the description of the -N flag and numbered arguments.

Thanks to oletange for pointing out the parallel file redirection issue.

March 13, 2015

Literate programming, RStudio, and IPython Notebook

A substantial portion of scientific software is written by scientists with little or no formal training in software engineering. As a result, many scripts and programs are untested, unreliable, a huge pain to install or configure, and unlikely to be accessible for more than a few months or years. One does not need to be a software engineer to write good, useful software. For example, simply adding a reasonably descriptive command-line interface to a script or program does not take much work, and can go a long way in making your program easy to use. Placing the software under version control and adding automated tests also provide a huge benefit without a significant amount of additional effort. An aside: I was recently certified as an instructor for Software Carpentry, which provides excellent 2-day workshops introducing these topics at a beginner level to scientists.

My point is that if you’re writing software to be distributed amongst collaborators and colleagues, there are well established software engineering practices to make this a painless experience for everyone involved, and learning the basics doesn’t require much work.

However, a lot of scientific computing does not involve writing scripts or programs to be used by a wide audience. Often, you need to explore the data. You want to compute a variety of statistics and visualize data distributions. You need to download other data sets for comparison, and you feverishly take notes to make sure you can repeat the analysis later if needed—and let’s just be honest here, of course you’re going to have to redo it later! Exploratory data analysis is a tedious and frustratingly messy process. We often have to throw everything we know about sound software engineering out the window in the name of simply getting crap done. And even if you are able to organize your data and scripts and notes just so, sharing this with colleagues and disclosing details for publication can also be very messy.

For over a year I had been hearing a lot about RStudio and IPython Notebook, but it wasn’t until I finally tried them a couple of months ago that I realized how big of a game changer these tools are for data analysis. Both facilitate a kind of literate programming where natural language (like English prose) can be interspersed among source code and the output produced by executing that code. This facilitates placing an entire analysis in a single easy-to-edit document, which renders beautifully and seamlessly with formatted text, blocks of syntax-highlighted source code, and code output, including both text and graphics. These individual documents are easily placed under version control and shared with colleagues via email, GitHub, or online hosting services like RPubs and IPyNb Viewer.

So how do RStudio and IPython Notebook work? RStudio provides an integrated development environment (IDE) for R programming. In that environment, you can create an Rmd (R Markdown) file for your analysis. You can add text to describe your analysis, and Markdown allows you to format the text with headers, bold, italics, hyperlinks, and so on. Any code that you want to execute is placed in code blocks, and then when you want to execute that code you click the “knit HTML” button. This will create an HTML document with your formatted text, your syntax-highlighted code, and any output produced by the code (whether it be text or a graphic).

The IPython Notebook is pretty similar, except that the notebook runs in your web browser instead of an IDE. Perhaps the biggest functional difference between the two is that in IPython Notebook, you can run and re-run individual code blocks independently and see the results. This is helpful if you want to rapidly test small changes without re-running time intensive steps. In RStudio you must re-knit the entire document if you want to see any changes.

So if it’s not clear already how this solves the problem of messy exploratory data analysis, here’s my perspective. Using IPython notebook, on a whim I can launch into a computing task, and my notebook will keep a faithful record of all that I did. At any point, I can save the notebook and close it, and then come back to it later. I can go back through later and add notes, or clean up any existing notes. I can easily publish it online for my collaborators to see, and if they’re so inclined, they can easily download and edit the notebook themselves!

So whether you prefer R or Python for your data analysis needs, we now have excellent free tools at our disposal to take a lot of the pain out of your scientific computing.

March 13, 2015

UNIX sort: sorting with both numeric and non-numeric keys

Tabular data is the bread and butter of biology and bioinformatics. Comma- or tab-separated values are easy to write, easy to read, and most importantly they play nice with UNIX shell tools. The UNIX sort command is particularly useful for–you guessed it–sorting tabular data.

Consider the following example, a dummy data set with alphanumeric data in the first and second columns, and numeric data in the third column.

isoform10	False	10.89
isoform1	True	21.98
isoform10	True	3.55
isoform7	True	0.67
isoform9	False	0.99
isoform7	False	0.66
isoform3	True	19.01
isoform7	True	2.48
isoform12	False	11.53
isoform4	True	1.73
isoform12	True	4.30
isoform3	False	0.25

We can sort this data by the first column like so. Note that this does a lexicographical sort, which places isoform10 before isoform2, etc.

[standage@lappy ~] sort -k1,1 mydata.tsv 
isoform1	True	21.98
isoform10	False	10.89
isoform10	True	3.55
isoform12	False	11.53
isoform12	True	4.30
isoform3	False	0.25
isoform3	True	19.01
isoform4	True	1.73
isoform7	False	0.66
isoform7	True	0.67
isoform7	True	2.48
isoform9	False	0.99

Sorting by the third column works the same way, although we get wonky results (i.e. a lexicographical sort instead of a numeric sort) if we don’t specify that it’s numeric data.

[standage@lappy ~] sort -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform7	True	2.48
isoform1	True	21.98
isoform10	True	3.55
isoform12	True	4.30

We need to use the -n flag to indicate that you want a numeric sort.

[standage@lappy ~] sort -n -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform7	True	2.48
isoform10	True	3.55
isoform12	True	4.30
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform1	True	21.98

What if we now want to sort by two columns? The sort command allows you to specify multiple keys, but if we use the -n flag it will apply to all keys, leading to more wonky results.

[standage@lappy ~] sort -n -k1,1 -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform7	True	2.48
isoform10	True	3.55
isoform12	True	4.30
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform1	True	21.98

Man pages to the rescue! Any of the sort program’s flags (such as -n for numeric sort or -r for reverse sort) can be added to the end of a key declaration so that it applies only to that key. Therefore, if we so desired, we could apply a reverse lexicographical sort to the first column and a numerical sort to the third column in this fashion.

[standage@lappy ~] sort -k1,1r -k3,3n mydata.tsv 
isoform9	False	0.99
isoform7	False	0.66
isoform7	True	0.67
isoform7	True	2.48
isoform4	True	1.73
isoform3	False	0.25
isoform3	True	19.01
isoform12	True	4.30
isoform12	False	11.53
isoform10	True	3.55
isoform10	False	10.89
isoform1	True	21.98

That’s it!