Category: UNIX

Leading zeros with bash brace expansion

I use bash brace expansion extensively, both in interactive terminal sessions and in shell scripts. The idea is pretty simple: the simple syntax {n..m} is a shortcut to an array containing all of the integers from n to m, inclusive. I’ve used this syntax several times in examples on this blog, including a recent post about task automation in bash.

Today one of my social media acquaintances posted the following on Twitter.

The implication is pretty clear: if you want the values in the array to have leading zeros, just add the leading zeros to the first number in the range. I checked that this works for arbitrary ranges, so things like {001..200} and {00001..50000} also work as expected.

This is a very nice feature I didn’t know about before. Even when you’re lucky enough to have a data set that is nicely organized with well-thought-out filenames, it’s a 50/50 chance whether serially numbered files/samples will have leading zeroes or not. This bash feature makes it dead simple to handle both cases trivially!

Advertisements

The fastest darn Fastq decoupling procedure I ever done seen

There is a joke out there that bioinformatics is 80% file conversion, and the joke always seems to elicit uncomfortable laughs: it’s funny, but hits embarrassingly close to the mark! I remember the midterm in my first computational biology class as an undergrad. None of the students in the class had any success getting to the intellectual aspect of the assignment; instead, we all ended up in the mire of converting data between file formats. There are so many formats out there, each tool seems to prefer a different one, and while file conversion is a pretty mundane task for experienced bioinformatics scientists, it is anything but intuitive for a beginner. Anyway, I digress…

The Fastq format is one of the most pervasive formats in biology and bioinformatics as it has become the de facto standard (or, more accurately, collection of related but not entirely compatible standards) for encoding huge collections of short nucleotide reads. Two* conventions have emerged for organizing sequence data when a sequencer reads nucleotides from both ends of a nucleotide fragment (paired-end sequencing). Interleaved Fastq files put read pairs together in the same file, with the forward read followed directly by its paired reverse read. Paired Fastq files place each read pair in the same position in two paired files—that is, the first record in file 1 and the first record in file 2 is a pair, the second record in file 1 and the second record in file 2 is a pair, and so on.

I shudder to think about how many CPU cycles are spent each day around the world converting between interleaved Fastqc files and paired Fastq files (and what this corresponds to in terms of carbon emissions, wasted time, and taxpayer money). Tools for quality control, mapping, and assembly have their preference for one or the other. Some blessed souls have implemented support for both in their software, but chances are if you’re piping together 3 or more tools in your genome analysis workflow, there will be one that only accepts interleaved and one that only accepts paired.

The code needed to do these conversions is simple, and scripts for doing these conversions are abundant. Not all are created equal, though, as I recently found out.

I’ve blogged before about how amazingly useful the paste command can be for working with Fastq data. In fact, combining paste with process substitution will get you a sweet little paired-to-interleaved converter without a single line of Perl or Python!

paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) \
    | tr '\t' '\n' \
    > reads-int.fastq

However, I never quite figured out how to do the inverse conversion (decoupling) using just command line tools. I used to write short Perl scripts to do this pretty frequently, but after becoming familiar with the khmer library I started saving myself the trouble and just using the split-paired-reads.py script that comes in the khmer distribution. Recently I was using this script to decouple some interleaved reads, and I was getting very impatient with how long it was taking. While it was running, I did a bit of Google searching and found this gem: the shell-based answer to all my Fastq-decoupling needs! Like my interleaving procedure above, it uses paste and process substitution, but it also takes advantage of the oft-overlooked tee command. The basics of this procedure are as follows (the script provides support for compressed data, but we’ll ignore that here).

paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

This procedure did the conversion in 2.5 minutes, while I terminated the incomplete split-paired-reads.py procedure after 18 minutes. Although I remain a big fan of the khmer library for a variety of reasons‡, I am convinced yet again that proficiency with—and eventually mastery of—shell tools is one of the most important skills a bioinformatics scientist can have.


*A third convention, “broken paired”, has been proposed by Titus Brown. This would allow a single file to contain both paired and unpaired reads. However, this has yet to gain wide adoption.

‡The khmer library initially drew my attention with its implementation of digital normalization, which remains its largest selling point for me. And while shell tools clearly outperformed khmer for a simple use cases, khmer also parses other data formats in both compressed and uncompressed, and probably supports multiline Fastq and broken paired Fastq. And then of course, khmer will integrate very nicely with environments like IPython notebook, whereas integration of shell commands could potentially be a bit more kludgy.

Writing clean Bash code

The shell is a paradox. It is mysterious in its complexity and marvelous in its simplicity. You can learn the bare basics with just of couple hours of tinkering, but becoming a shell master takes years of consistent use, and even then it feels like you learn something new every day. On one hand, it has an ugly syntax, and trying to work with strings and arrays is beyond frustrating. On the other hand, there’s nothing quite like the zen achieved when you stitch together a half dozen shell commands to extract and format your data of interest, creating a pipeline that outperforms any Python or C++ code you could ever write.

Whatever your feelings about the shell in general, or about (the most popular shell) Bash in particular, you cannot deny that it is a rustic language with very poor debugging facilities that makes it very easy to shoot yourself in the foot. If you can avoid writing code in Bash, you should. In fact, as blogger Jeff Atwood says, if you can avoid writing code at all, you should. But if you’re a modern biologist then chances are that sooner or later the shell will be the best tool for a particular job. What do you do then?

Over the last few months I’ve really tried to look for ways to improve my shell code. Here are some of the resources I’ve found.

  • Defensive Bash Programming: The author of this blog post provides some suggestions for keeping Bash code clean and readable—often a real challenge when stitching together shell commands in a pipeline, for example. He also provides suggestions for organizing the code and using variables carefully to avoid common pitfalls. If I’m writing a simple Bash script with just a few lines, I’d probably ignore most of this advice, but the longer the script gets, the more important it is to take extra steps to be careful and make your intent clear.
  • Shellcheck: So I was telling the truth when I said there’s no debugger for Bash, but this tool does its best to remedy that. If you paste a shell script into Shellcheck, it will analyze the code and point out potential issues or pitfalls.
  • Unofficial Bash strict mode: This is probably my favorite. The Perl language has a strict mode that makes writing and debugging Perl code a whole lot less frustrating. Bash has no such strict mode, but there are some things you can do to enforce safer behavior in your bash scripts. This blog suggests placing two lines at the beginning of every Bash script, and then describes why this is a good idea in a variety of scenarios. I haven’t taken his suggestion fully to heart quite yet, but I have begun placing set -eo pipefail at the beginning of most of my Bash scripts, and I have already saved myself tons of heartache and wasted time by doing so.

So the next time you absolutely, positively have to write some shell code, consider referring to these resources to improve the experience!

UNIX sort: sorting with both numeric and non-numeric keys

Tabular data is the bread and butter of biology and bioinformatics. Comma- or tab-separated values are easy to write, easy to read, and most importantly they play nice with UNIX shell tools. The UNIX sort command is particularly useful for–you guessed it–sorting tabular data.

Consider the following example, a dummy data set with alphanumeric data in the first and second columns, and numeric data in the third column.

isoform10	False	10.89
isoform1	True	21.98
isoform10	True	3.55
isoform7	True	0.67
isoform9	False	0.99
isoform7	False	0.66
isoform3	True	19.01
isoform7	True	2.48
isoform12	False	11.53
isoform4	True	1.73
isoform12	True	4.30
isoform3	False	0.25

We can sort this data by the first column like so. Note that this does a lexicographical sort, which places isoform10 before isoform2, etc.

[standage@lappy ~] sort -k1,1 mydata.tsv 
isoform1	True	21.98
isoform10	False	10.89
isoform10	True	3.55
isoform12	False	11.53
isoform12	True	4.30
isoform3	False	0.25
isoform3	True	19.01
isoform4	True	1.73
isoform7	False	0.66
isoform7	True	0.67
isoform7	True	2.48
isoform9	False	0.99

Sorting by the third column works the same way, although we get wonky results (i.e. a lexicographical sort instead of a numeric sort) if we don’t specify that it’s numeric data.

[standage@lappy ~] sort -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform7	True	2.48
isoform1	True	21.98
isoform10	True	3.55
isoform12	True	4.30

We need to use the -n flag to indicate that you want a numeric sort.

[standage@lappy ~] sort -n -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform7	True	2.48
isoform10	True	3.55
isoform12	True	4.30
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform1	True	21.98

What if we now want to sort by two columns? The sort command allows you to specify multiple keys, but if we use the -n flag it will apply to all keys, leading to more wonky results.

[standage@lappy ~] sort -n -k1,1 -k3,3 mydata.tsv 
isoform3	False	0.25
isoform7	False	0.66
isoform7	True	0.67
isoform9	False	0.99
isoform4	True	1.73
isoform7	True	2.48
isoform10	True	3.55
isoform12	True	4.30
isoform10	False	10.89
isoform12	False	11.53
isoform3	True	19.01
isoform1	True	21.98

Man pages to the rescue! Any of the sort program’s flags (such as -n for numeric sort or -r for reverse sort) can be added to the end of a key declaration so that it applies only to that key. Therefore, if we so desired, we could apply a reverse lexicographical sort to the first column and a numerical sort to the third column in this fashion.

[standage@lappy ~] sort -k1,1r -k3,3n mydata.tsv 
isoform9	False	0.99
isoform7	False	0.66
isoform7	True	0.67
isoform7	True	2.48
isoform4	True	1.73
isoform3	False	0.25
isoform3	True	19.01
isoform12	True	4.30
isoform12	False	11.53
isoform10	True	3.55
isoform10	False	10.89
isoform1	True	21.98

That’s it!

Data backup with crontab and iput

This week we had a system error which led to catastrophic data loss in my lab. Fortunately it was restricted to a single virtual machine, but unfortunately that virtual machine happened to be my personal workbench. Inconvenient losses include data, scripts, and notes from recent projects and from just about all of my graduate coursework. The absolutely devastating loss, however, was my electronic lab notebook, which was hosted as a wiki on the machine. Luckily I had done some backups of my lab notebook, but the most recent one I could find was from May 2013. As happy as I am to have avoided losing my entire graduate lab notebook, losing 8-9 months’ worth is still heartbreaking.

So I just finished doing what I should have done a long time ago: automate my backup procedure. Let me introduce the crontab command. The crontab command lets you edit a system file which specifies commands to be run at regular intervals by your system. Using crontab, you can set up cron jobs to run hourly, daily, or weekly, with lots of flexibility. Here are a few examples.

# Execute 'somecommand' command at the beginning of every hour
0 * * * * somecommand

# Execute 'somecommand' at 1:00am and 1:00pm every day.
0 1,13 * * * somecommand

# Execute '/home/standage/check-submissions' every minute between the hours 
# of 12am-2am on Mondays and Wednesdays.
* 0-2 * * 1,3 /home/standage/check-submissions

You can run man crontab on your system for a complete description of available options.


So as far as the specifics of my backup procedure, I decided a weekly backup would be sufficient: specifically, at 2am on Saturday morning when the chances of me doing any research are pretty slim. So I ran crontab -e to open my crontab file, and added the following entry.

0 2 * * 7 /home/standage/bin/wiki-backup

The file /home/standage/bin/wiki-backup is an executable bash script that includes that commands needed to perform each backup. This particular script creates a gzip-compressed tar archive of my lab notebook, and then copies it over the network to my directory in the iPlant data store using the iput command. If I had Box or Dropbox installed on that machine, I could just have easily replaced the iput command with a cp command that copies the data backup to the directory on my local system that syncs with the cloud.

#!/usr/bin/env bash

cdate=`date '+%F'`
backdir=/home/standage/Backups/labnotebook
backfile="labnotebook.$cdate.tar.gz"

cd /var/www/html
tar czf $backdir/$backfile labnotebook

cd $backdir
iput -V $backfile Backups

Hopefully this example gives a clear idea of what is possible with cron jobs and how easy it is to set up automatic backups for your critical research data.

Running CCM jobs on Big Red II

Since moving to Indiana University with my advisor, I have made frequent use of the HPC resources made available to grad students, staff, and faculty here. I have made particularly frequent use of the Mason cluster, which has 16 machines, each with 500 GB available memory and 32 processors.

However, over this last weekend I wanted to make use of IU’s new Big Red II. This is your much more typical HPC environment designed for distributed-memory MPI jobs utilizing processors on different machines in parallel. I occasionally run MPI jobs, so I am familiar with the likes of mpirun and mpiexec (Big Red II uses aprun). These particular jobs, however, were not MPI jobs, and therefore needed to be run in “cluster compatibility mode”, or CCM for short. Below is a simple example launch script that will execute a BLAST job in CCM mode on Big Red II.

#!/bin/bash

#PBS -N NameOfYourBlastJob   # a name for the job
#PBS -l nodes=1:ppn=32       # request a single machine with 32 processors for this job
#PBS -l walltime=12:00:00    # request a 12-hour time slot for this job
#PBS -q cpu                  # use the 'cpu' queue
#PBS -l gres=ccm             # use Cluster Compatibility Mode
#PBS -M username@indiana.edu # email address to which notifications will be sent
#PBS -m bea                  # send email notifications when the job begins, ends, and/or aborts

# Place 'module load ccm' in your ~/.modules file
WD=/N/dc2/scratch/username/working-directory

ccmrun blastx -query $WD/sequences.fna \
       -db $WD/reference-proteins.faa \
       -out $WD/blast-results.xml \
       -evalue 1e-15 \
       -outfmt 5 \
       -show_gis \
       -num_alignments 10 \
       -num_threads 32

Then as usual, the job can be launched by running qsub your-launch-script.sh.

Process substitutions: commands in place of program arguments

In my tutorial on designing command line interfaces for scientific software, I propose the idea that most software could benefit from the ability to read from standard input and write to standard output—following the paradigm implemented by standard UNIX shell tools which enables their stitching together using pipes and redirects. Not too long ago, however, I came across some syntax that will forever change the way I think about this topic.

Imagine you have a program called myprog that reads from standard input—you could stitch it together with other programs like this.

someCommand | cut -f 2,4,6-10 | sort | myprog

Using the nifty process substitution syntax I recently learned, you could do the same thing like this.

myprog <(someCommand | cut -f 2,4,6-10 | sort)

Surely you must be underwhelmed…but bear with me. This may not seem like much, but the primary reason this is really cool is that you can redirect the output of multiple commands to your program, treating each as a positional argument. As far as I know, the standard piping and redirection syntax limits you to a single data stream. But process substitution allows you to feed multiple data streams into your program without the responsibility of dealing with intermediate files. UNIX creates temporary “anonymous named pipes”, but the data never hit disk.

So if you have another program called otherprog that takes three arguments, you can use the following syntax.

otherprog <(grep -v '^@' reads.sam) <(grep '^chr1' genes.gff3 | cut -f 4-6) <(blastdbcmd -db prot.fa -entry_batch seqids.txt)

In this example, the otherprog program is able to read input from 3 separate data streams without the need to store the data in temporary data files. This is extremely convenient and I have already begun to make extensive use of this syntax in my daily work.

PS Props to @climagic for sharing this in the first place, and to @vsbuffalo whose blog post (which I saw after writing this post–see comments) provided some better terminology!