Category: Linux

Executable makefiles

I’ve talked before about the merits of using Make and makefiles to implement bioinformatics pipelines. But I had a revelation today: just as one can use the shebang (#!) to run Perl, Python, Ruby, Bash, or other scripts without calling the interpreter directly, I should be able to use the shebang to make a makefile executable. Simply place #!/usr/bin/env make on the first line of the makefile, make sure you have execute permissions, and then you’re all set, right?

Well, there’s one gotcha. The problem with using just #!/usr/bin/env make as the shebang is that the name of the script being executed is implicitly placed as the final argument on the shebang line. Make therefore treats the makefile as a target rather than an actual makefile. With that in mind, the fix is simple: add the -f flag to indicate the script is a makefile and not a target. A big thank you goes out to this StackOverflow thread for providing a clear solution to this problem.

The example below uses a trivial makefile to demonstrate how executable makefiles can be written and executed.

[standage@lappy make-demo] ls -lh run-it 
-rwxr-xr-x  1 standage  staff   218B Nov 21 23:53 run-it
[standage@lappy make-demo] cat ./run-it 
#!/usr/bin/env make -f

MESSAGE=Dude

all:            shoutout.txt dup.txt
                

clean:          
                rm -f shoutout.txt dup.txt

shoutout.txt:   
                echo $(MESSAGE) > shoutout.txt

dup.txt:        shoutout.txt
                cp shoutout.txt dup.txt
                cat dup.txt
[standage@lappy make-demo] ./run-it 
echo Dude > shoutout.txt
cp shoutout.txt dup.txt
cat dup.txt
Dude
[standage@lappy make-demo] ./run-it clean
rm -f shoutout.txt dup.txt
[standage@lappy make-demo] ./run-it MESSAGE=Sweet
echo Sweet > shoutout.txt
cp shoutout.txt dup.txt
cat dup.txt
Sweet
[standage@lappy make-demo]

Shuffling columns of a tabular file with cut and paste

I’ve been working on an RNA-Seq analysis recently, and discussing the results with my advisor. Based on his suggestion, I’ve decided to try to get a baseline proportion of transcripts reported as differentially expressed by shuffling labels and re-running the analysis. This type of approach–a permutation test–is quite common in statistics.

The input to the differential expression software I am using (and indeed most others) is a tabular plain text file where the first column contains a molecule label/ID and each subsequent column contains that molecule’s expression level for a given sample. Since the software I’m using requires all samples corresponding to a condition to be adjacent, shuffling the condition label means shuffling the columns of this table.

I’ve talked about the cut and paste commands before (here and here), but I’ve perhaps missed their canonical usage until now. Here are the steps that I used to create shuffled files for the permutation test.

  1. Create a new file for each column of the table using the cut command: cut -f 1 for the molecule IDs, cut -f 2 for the expression levels for the first sample, cut -f 3 for the expression levels for the second sample, etc.
  2. Use the paste command to put the columns back together in a new order.

I created an asciicast demonstrating this on a small dummy data set: see http://asciinema.org/a/5714.

Bash tricks: getopts and die signal handling

For better and for worse, Perl has been my scripting go-to language for years. I’ve since learned Python, and can appreciate why it has won so many crazed evangelists enthusiasts in the programming community (in general) and the scientific computing community (in particular). However, I’m all about using the best most convenient tool for the job, and sometimes the best glue for Your Little Bioinformatics Tool is a makefile, or even just a simple little shell script.

Recently I was writing a bash script to implement a very simple procedure, stringing together the results of several calls to small scripts and programs I had written. As is typical for bash scripts I have written in the past, I used positional command-line arguments for any values I needed to adjust on a run-by-run basis, and then accessed these in the script using the variables $1, $2, and so on.

As I started running the script to do my analyses, I began thinking I wish there was a better way to do this, to make some arguments optional but some required—something like getopts. Well, a simple Google search solved that one for me. A few minutes later, I had put a nice command-line interface on my bash script. The syntax is really pretty simple.

# Usage statement
print_usage()
{
  cat <<EOF
Usage: $0 [options] genomeseq.fasta annotation.gff3
  Options:
    -c    some important cutoff value; default is 0.2
    -d    debug mode
    -h    print this help message and exit
    -o    file to which output will be written; default is 'ylt.txt'
    -t    home directory for YourLittleTool; default is '/usr/local/src/YourLittleTool'
EOF
}

# Command-line option parsing
CUTOFF=0.2
DEBUG=0
YLTHOME="/usr/local/src/YourLittleTool"
OUTFILE="ylt.txt"
while getopts "c:dho:t:" OPTION
do
  case $OPTION in
    c)
      CUTOFF=$OPTARG
      ;;
    d)
      DEBUG=1
      ;;
    h)
      print_usage
      exit 0
      ;;
    o)
      OUTDIR=$OPTARG
      ;;
    t)
      YLTHOME=$OPTARG
      ;;
  esac
done

# Remove arguments associated with options
shift $((OPTIND-1))

# Verify the two required positional arguments are there
if [[ $# != 2 ]]; then
  echo -e "error: please provide 2 input files (genome sequence file (Fasta format) and annotation file (GFF3 format))\n"
  print_usage
  exit 1
fi
FASTA=$1
GFF3=$2

# Now implement the procedure of  your little tool

So on one hand, this does add quite a bit to a bash script that originally had only 4-8 lines of logic. But on the other hand, with not too much work on my part, it now has a convenient and self-documented interface that makes it much easier in case someone else in my lab (or if I’m so lucky, someone “out there”) wants to use it in the future.

As I was sprucing up the bash script, I also decided to investigate another feature I was interested in. This particular procedure creates a new directory, into which several data files, graphics, and HTML to reports are written. If the procedure failed and prematurely terminated, I wanted the default behavior to be that the output directory gets deleted so as not to interfere with subsequent run attempts (and of course I provided an option not to delete the output on failure, which is essential for troubleshooting bugs in the pipeline). I had already added set -e to the script, which kills execution of the script if any command returns an unsuccessful status. While this is very convenient, it could potentially have made it pretty complicated to delete incomplete output at different stages of the pipeline.

Enter trap. This keyword is meant to associate a handler function with various signals, one of which is the ERR signal which is fired when a bash script terminates with an error.

die_handler()
{
  if [[ !($DEBUG) ]]; then
    rm -r $OUTDIR
  fi
}
trap die_handler ERR

The trap statement above essentially says in case of an error causing premature script termination, run the die_handler function.

I’ve always considered bash scripts to be pretty hackish, and I’m not sure this experience has completely changed that opinion (lipstick on a pig?). However, for this particular case I was very happy I was able to combine the convenience of a bash script with the flexibility and power provided by getopts and event-based error handling.

RTFM: paste

Recently I’ve used this blog to document my quest for better understanding of basic UNIX commands and their utility in data processing (here here and here). I recently came across another command that has already proven itself extremely useful.

paste

If you read the manual for the paste command or try the default usage via the command line, you will surely be underwhelmed. This is from the man page…

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

…and this is the basic usage.

[standage@lappy ~] cat bogus.data 
dude
sweet
awesome
cool
[standage@lappy ~] paste - < bogus.data 
dude
sweet
awesome
cool
[standage@lappy ~]

Looks like just another cat command, huh? Well, note the dash symbol following the paste command. This is commonly used to indicate that a program/command should read from standard input rather than from a file. The paste magic begins when you start adding additional dashes.

[standage@lappy ~] paste - - < bogus.data 
dude	sweet
awesome	cool
[standage@lappy ~]

This time it read two lines and printed them out on a single line separated by a tab. Increasing the number of dashes will increase the number of lines from the input that paste will print together on one line of output. This is a simple command, but it can be extremely useful for processing data files where a complete record is stored on a fixed number of lines (such as with the fastq format, where a sequence corresponds to 4 lines).

Recently, my advisor had some interleaved Fastq files but wanted to run a software package that expected mate pairs to be placed in separate files. Before searching for a program to split the files or writing one to do it himself, he sent me a quick note asking whether we had already installed any programs on our server that would do this. I responded and suggested he try the following command.

paste - - - - - - - - < interleaved.fastq | \
perl -ne '@v = split(/\t/); printf("%s\n%s\n%s\n%s\n", @v[0..3]); printf(STDERR "%s\n%s\n%s\n%s", @v[4..7]);' \
> 1.fq 2> 2.fq

In this one-liner (well, I spread it over 3 lines for readability), the paste command reads in 8 lines (2 fastq records corresponding to a mate pair) and combines those 8 lines into a single line with 8 tab-delimited values. The Perl command then splits the input, writes the first 4 values to the standard output and the second 4 values to the standard error. Redirect stdout and stderr to different files, and you’ve got your paired Fastq files!

Credit for introducing me to this command goes to this post, which has a couple additional examples.

RTFM: comm

Comparing lists of IDs, filenames, or other strings is something I do on a regular basis. When I was an undergrad, I remember using a Perl script someone in our lab had written to look at two files and perform simple set operations (pull out the intersection of two lists, or the union, or unique values from one list or the other). Over the years, as the need to perform such tasks has frequently recurred, I’ve repeatedly had to dig through my old files looking for the script.

Recently, the need to do some set operations came up again, but rather than scraping around for this script I figured I should learn how to Do It the Right Way, e.g., perform the task using standard UNIX command(s). Enter the comm command.

comm

I’m guessing “comm” is short for common. It is designed precisely for the use case I described above. It takes two files (assumed to be sorted lexicographically) and produces 3 columns of output. The first column corresponds to values found only in the first file, the second column corresponds to values found only in the second file, and the third column corresponds to values found in both files. The command has flags that enable case-insensitive comparisons and, more relevant to the question at hand, exclusion of one or more of the columns of output. For example, if you want to pull out just the values found in both file1 and file2 (the intersection), you would use the following command.

comm -12 file1 file2

If you wanted to pull out the values unique to file1 using case-insensitive comparison, you would use the following command.

comm -23i file1 file2

Today’s lesson is brought to you by this thread on ServerFault@StackExchange.

SSH and file sharing with guest operating system

I’m a big fan of Apple products. I really like the OS X operating system, and I love the hardware on which it runs. I also love that OS X is built on a UNIX core, which complements the graphical interface with a solid command-line interface that runs a lot of scientific and open-source software out-of-the-box. However, some of the scientific tools I use are difficult (if not impossible) to configure, program, and run without a Linux operating system. So my solution for several years has been to run a Linux virtual machine via VMware or VirtualBox on my Apple hardware. When I need the Linux environment (which is most of the time), I fire up the Linux guest OS and get to work. Despite how far Linux GUIs have come, however, I still prefer OS X for web browsing, Skype, and just about anything else that doesn’t involve the command line.

This week I decided that I would invest some time figuring out how to interact with my guest OS (Linux) while staying completely in my host OS (Mac OS X)–essentially treating my guest VM as a remote machine. Essentially, this amounted to two tasks: opening up shell access and exposing the file system.

Shell access

Enabling shell access was pretty simple. On the guest side, all I had to do was install OpenSSH (sudo apt-get install openssh-server for Debian-based distros). On the host side, I simply had to create a host-only network in VirtualBox and edit my VM’s network settings. Here’s exactly what I did.

  • In VirtualBox, open Preferences and select the Network tab
  • Add a host-only network by clicking the “+” icon, and then click ok
  • Back in the main VirtualBox menu, select the virtual machin, click ‘Settings’, and select the Network tab
  • Adapter 1 should be attached to NAT, so select Apapter 2, enable it, and attach the host-only adapter just created. Click ok

At this point, the virtual machine should respond to a ping and an SSH login attempt. Just use the ifconfig command in the guest machine to determine its IP address.

File system

I do a fair amount of coding using command-line text editors such as vim and nano. However, when I’m coding for hours and hours, I do prefer using a graphical interface, so being able to open files on the guest machine (Linux) with a text editor on my host machine (OS X) required some file sharing. VirtualBox provides some extensions to facilitate file sharing from the host to the guest, but what I want required sharing the other way around–from guest to host.

I ended up installing the Samba file sharing service. I followed the instructions in these tutorials (here and here) pretty closely, so rather than rehashing those points I’ll just leave the links.

Once Samba was up and running, I was able to connect by opening Finder, selecting “Connect to server” (command-K), and then entering smb:// followed by the guest machine’s IP address.

Command-line magic for your gene annotations

In the GFF3 specification, Lincoln Stein claims that the primary reason tab-delimited annotation formats have persisted for so many years (despite the existence a variety of worthy alternative formats) is the ease with which tab-delimited data can be edited and processed on the command line. In this post I wanted to see what kinds of information are easily extracted from a GFF3 file using basic command line tools (i.e. no reusable parsers as provided by BioPerl, BioPython, et al).

Determine all annotated sequences

The GFF3 spec includes the ##sequence-region pragma whose purpose is to specify boundary coordinates for each annotated sequence in the file. However, I see this pragma neglected more often than not, both by data files and by tools. Plus, there is no strict requirement that the file must include features corresponding to each ##sequence-region entry. So if you want to determine which sequences for which a given GFF3 really provides annotations, it’s best to look instead at the first column of each feature.

Here is a command that will print out all of the sequences annotated by a given GFF3 file. A detailed explanation is provided below.

cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq -c | sort -rn | head
  • The first cut command will attempt to extract and print the 1st and 9th column of each line in the file. We ultimately don’t need the 9th column, but it’s helpful at this point since it allows us to distinguish our entries of interest (features) from other entries (directives, pragmas, comments, and sequence data). This command will output two fields for each feature, while it will only output a single field for all the other entry types. This enables filtering out of non-features with subsequent commands.
  • The grep command will check each line of the previous command to see whether it contains a tab character (separating two fields). Only lines that contain a tab are printed. The output of this command is all of the features from the GFF3 file.
  • The second cut command will cut out the first field from each feature and ignore the other, since we weren’t really interested in it in the first place. The output of this command is the sequence corresponding to each feature.
  • The first sort command will (surprise!) sort the output of the previous command.
  • The uniq command will collapse identical adjacent lines and print out the number of times each line is seen. The output of this command is what we’re looking for and really we could stop here. The next two commands are simply for convenience.
  • The second sort command will sort the sequence IDs according to the number of corresponding annotated features.
  • The head command will simply ensure that you only see the first 10 lines of output (useful if your GFF3 file includes thousands of sequences, such as scaffolds from a draft assembly).

Determine annotation types

Perhaps the most common use for the GFF3 format is encoding protein-coding gene structures. However, the format is very flexible and can leverage the Sequence Ontology to provide annotations for just about anything you could imagine related to biological sequences. So when working with annotations from an external source, it’s always a good idea to verify your assumptions about precisely what type of annotations are provided.

Here is a command that will print out all of the feature types contained in a given GFF3 file, along with a detailed explanation.

grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq -c | sort -rn | head
  • We’re interested only in features, so the grep command will print out all lines that do not start with a pound symbol.
  • The cut command will cut out the third column of each feature and print it. This command will automatically ignore any unwanted data that may not have been filtered out by the previous command (such as sequence data), since those data do not contain tabs.
  • The purpose of the remaining commands is identical to the previous example: sort the output, collapse adjacent lines that are identical, and print out each line and the number of times it occurs. Again the last two command are provided only for convenience so that your terminal is not flooded with data.

Determine the number of genes

The previous command gives the number of occurrences of each feature type in the GFF3 file. However, if you are only interested in the number of occurrences of a single feature type (say, genes), then there is a much simpler solution.

grep -c $'\tgene\t' yourannots.gff3

Replace gene with your desired feature type and that’s that!

Extract all gene IDs

The previous example can easily be extended to pull out all the IDs for a give feature type. Here is a command that will print out all gene IDS for a given GFF3 file, with a corresponding explanation. This command is easily extended to other feature types.

grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)' | head
  • As in the previous example, the grep command will print only lines that contain gene features.
  • The Perl one-liner uses a regular expression to identify the ID attribute and print it.
  • The head command is only for convenience and simply makes sure you don’t flood your terminal with IDs.

Print length of each gene

The previous examples can also be extended quite easily to print out the length of all features of a given type. This is very helpful if you want to analyze and/or visualize a distribution of feature lengths. Here is a command that will print out the length of each gene in a given GFF3 file. Again, this is easily adapted to other feature types.

grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | /
                                   perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)' | /
                                   sort -rn | head
  • As in the previous example, the grep command will print only lines that contain gene features.
  • The cut command will extract and print the 4th and 5th field of each feature, which corresponds to its start and end coordinates.
  • The Perl one-liner will print out the length of each feature given the coordinates.
  • As in previous examples, the remaining commands are solely for aesthetic convenience.

This command would typically be entered on a single line in your terminal. I only broke it up into multiple lines for readability.

Example

I downloaded gene annotations from TAIR to verify that these commands work as expected. Below is my terminal output.

standage@iMint ~ $ wget ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
--2012-08-10 15:16:45--  ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
           => `TAIR10_GFF3_genes.gff'
Resolving ftp.arabidopsis.org (ftp.arabidopsis.org)... 171.66.71.56
Connecting to ftp.arabidopsis.org (ftp.arabidopsis.org)|171.66.71.56|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Genes/TAIR10_genome_release/TAIR10_gff3 ... done.
==> SIZE TAIR10_GFF3_genes.gff ... 44139005
==> PASV ... done.    ==> RETR TAIR10_GFF3_genes.gff ... done.
Length: 44139005 (42M) (unauthoritative)

100%[=====================================================================================================================>] 44,139,005  6.67M/s   in 7.2s    

2012-08-10 15:16:54 (5.87 MB/s) - `TAIR10_GFF3_genes.gff' saved [44139005]

standage@iMint ~ $ cut -s -f 1,9 TAIR10_GFF3_genes.gff | grep $'\t' | cut -f 1 | sort | uniq -c | sort -rn | head
 157712 Chr1
 135017 Chr5
 113968 Chr3
  91857 Chr2
  90371 Chr4
    723 ChrM
    616 ChrC
standage@iMint ~ $ grep -v '^#' TAIR10_GFF3_genes.gff | cut -s -f 3 | sort | uniq -c | sort -rn | head
 215909 exon
 197160 CDS
  35386 protein
  35386 mRNA
  34621 five_prime_UTR
  30634 three_prime_UTR
  28775 gene
   3911 mRNA_TE_gene
   3903 transposable_element_gene
   1274 pseudogenic_exon
standage@iMint ~ $ grep -c $'\tgene\t' TAIR10_GFF3_genes.gff
28775
standage@iMint ~ $ grep $'\tgene\t' TAIR10_GFF3_genes.gff | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)' | head
AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01046
AT1G01050
AT1G01060
AT1G01070
AT1G01073
AT1G01080
standage@iMint ~ $ grep $'\tgene\t' TAIR10_GFF3_genes.gff | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)' | sort -rn | head
31258
26435
25965
23544
19753
19352
18492
18184
17943
17555
standage@iMint ~ $

Komodo Edit for Linux

I program on so many different machines and in so many different environments that no one text editor stands out as my favorite. I frequently work on remote machines where only SSH access is available, and so I have developed skills with the command-line editors vim and nano. Even when I’m working on my local machine, I will frequently stick to the command line. When I feel like I need/prefer a graphical text editor, I typically use TextWrangler or Fraise on Mac and gedit on Linux (and Notepad++ on Windows, but it’s been years since I’ve touched Windows machine for anything serious). I’ve been intrigued by more feature-rich editors and IDEs before (Eclipse and NetBeans come to mind), but I always felt like these tools got in the way more than they helped.

I remember briefly using an editor called Komodo on an iMac in my undergraduate research lab. The other day, I learned that Komodo is available for Linux (Komodo edit is free, Komodo IDE requires a paid license). Since my memories with Komodo from a few years ago were positive (or at least non-negative), I decided to install it and give it a shot. The install process was a breeze and I was soon at work on a coding task (Perl in this case). I was immediately at home with this editor–it offers everything you want in a text editor (including some nice features you typically find only in an IDE, such as syntax checking and function hint popups), but it does not get in the way of your programming.

Komodo Edit can be obtained for any platform at this download page. The installation process for Linux is quite simple, and I have included my notes below.

sudo mkdir /usr/local/src/KOMODO
cd /usr/local/src/KOMODO
sudo mv ~/Komodo-Edit-7.0.2-9923-linux-x86_64.tar.gz .
sudo tar xzf Komodo-Edit-7.0.2-9923-linux-x86_64.tar.gz
cd Komodo-Edit-7.0.2-9923-linux-x86_64/
sudo ./install.sh -I /usr/local/src/KOMODO/Komodo-Edit-7
echo 'export PATH=/usr/local/src/KOMODO/Komodo-Edit-7/bin:$PATH' >> ~/.bashrc
komodo

This did a system-wide install of Komodo Edit. If you do not have administrative priviledges or if you only want to install the program for a single user, you can leave off the -I and use the suggested default (which is in the current user’s home directory).

Replacing multiple strings with sed

I was chatting recently with a colleague about a data processing task he needed help with. After discussing things for a bit, it was clear that he simply needed a massive search and replace job. Of course, it would be easy to write a Perl or Python script to do this, but it would be a shame not to take advantage of the speed and convenience of the tools that already exist as part of the UNIX command line!

My colleague sent me a file that mapped each of the target strings to their replacement, in this format.

replacement1,target1a,target1b,target1c,...
replacement2,target2a,target2b,target2c,...
...

I saved this file as mapping.txt, and then created a simple sed script from it using the following Perl one-liner (I’ve added a line break for readability).

perl -ne 'chomp(); s/\s*$//; @v = split(/\s*,\s*/); if(@v > 0){ $k = shift(@v);
foreach $val(@v){printf("s/%s/%s/g\n", $val, $k)} }' < mapping.txt > replacements.sed

The sed script looks like this–easy enough to create manually if you have a small number of replacements, but I definitely needed to script it for this task.

s/target1a/replacement1/g
s/target1b/replacement1/g
s/target1c/replacement1/g
...
s/target2a/replacement2/g
s/target2b/replacement2/g
s/target2c/replacement2/g
...

Finally, here is the command I used to do the replacement.

sed -f replacements.sed < data.txt > data-new.txt

That’s all!

RTFM: touch, ls, xargs, cut, sort, uniq

This is the second installment of my quest to explore and better understand some common UNIX commands I thought I already knew. I had planned on covering sed and awk in this installment, but it turns out that these tools are more complex than I thought. They are not simply command line utilities, they are powerful text manipulation tools with languages of their own. Perhaps I will review them in the future, but for the time being I cannot give these tools the attention they deserve for a thorough treatment.

touch

The touch command is designed to update timestamps associated with files. Running touch on a file or set of files will update the access and modification timestamps to the current time. If you try to touch a file that does not exist, the command will create an empty file with that filename for you.

This is a very simple command (borderline trivial), and in the past I have used it almost exclusively for creating new empty files. However, updating file timestamps can be useful in a variety of contexts and knowing how to do this programmatically is useful. For example, in many supercomputing environments, some disk partitions are checked regularly and any file that has not been accessed or modified in the last 7 days is deleted. If you have some important data files on a scratch disk that you’re not currently using, but don’t want to lose (you’re going to use them soon), then updating the timestamps on the data files is what you need to do. Of course, you can open each file in vim or nano, but this becomes ridiculous if you have a lot of data files. Instead, simply run touch on the files to update their timestamps.

Here are a few useful options I just learned.

  • -a: change only the access time, not the modification time
  • -c: only update timestamps for existing files, do not create any new files
  • -d STRING: instead of using the current time, set the timestamp to STRING; the value of STRING is flexible and can be anything from Sun, 29 Feb 2004 16:21:42 -0800 to 2004-02-29 16:21:42 to next Thursday
  • -m: change only the modification time, not the access time

ls

The ls command is one of the first any UNIX user learns. It is used to list the contents of the current working directory. In my UNIX working environments, I alias the command list to ls -lhp. There are a few other options for sorting and otherwise managing the output of this command.

  • -a/-A: list all directory contents, including hidden files beginning with ., the current working directory ., and the parent directory .. (the -A option excludes these last two)
  • -d: treat files normally, but for directories, list the directory itself instead of directory contents
  • -h: make output more human readable (such as when used with -l option)
  • -l: print using a detailed listing format
  • -p: append / indicator to directories
  • -r: reverse the order of sorting
  • -R: list subdirectories recursively
  • -S: sort by file size
  • -t: sort by modification time
  • -X: sort by file extension

xargs

The xargs command is used to dynamically build and execute commands on the command line. Typically the output of other commands or programs is piped into xargs, which is then used to dynamically run commands based on the output. Here are some useful options.

The basic usage of xargs is as follows.

somecommand | xargs -I % someothercommand -arg1 value1 -arg2 % sometext

The -I option indicates which character(s) in the following commands should be replaced by the xargs input (in this example I chose the % character, but you’ve got some flexibility there). For example, if the somecommand command generates two lines of output (foo and bar), then the commands executed by xargs would be as follows.

someothercommand -arg1 value1 -arg2 foo sometext
someothercommand -arg1 value1 -arg2 bar sometext

Here is a non-trivial example.

find /data/runs -mindepth 4 -type d | xargs -I % mv % /data/backup

This command will start in the /data/runs directory, look for any directories that are nested 4 levels deep, and them move them to the /data/backup directory.

There are a few options that allow you to make slight modifications to xargs behavior.

  • -a file: read input from the given file rather than from the standard input
  • -d delim: use the given delimiter instead of the default newline character

cut

The cut command is designed to process data files (especially file in tabular format) and extract out relevant data. For example, if you have a tab- or comma-delimited file with several columns, the cut command can be used to cut out particular columns from that file. This is a very useful command in bioinformatics, despite the fact that it’s pretty simple and straightforward. However, the manual did teach me a few options that I wasn’t aware of before. Here are some helpful options.

  • -d delim: use the given delimiter instead of the default tab character
  • -f FIELDS: extract the given fields (columns) from the file (separate field/column numbers with commas)
  • --complement: extract the complement of the fields specified by -f
  • -s: only process lines that contain the delimiter; this can be useful for files that contain comments or other types of metadata that you don’t want to process

sort

The sort command will (you guessed it!) sort the lines of input. Looking at the sort manual didn’t reveal any spectacularly interesting options for this command, but it does provide a variety of different ways to sort the input (in contrast to the default ascii-cographical order). Knowing these options is helpful.

  • -d: sort by dictionary order, only considering blanks and alphanumeric characters
  • -f: case-insensitive sort
  • -h: sort by human-readable number value (the human-readable values generated by other UNIX commands, such as 2K or 1G)
  • -n: numeric sort
  • -R: random sort
  • -r: reverse the natural order of the sort
  • -o FILE: write output to FILE instead of the standard output

uniq

The uniq command is useful for reporting and counting duplicated lines of input. This command expects the input to be sorted ascii-cographically, so it is often used in conjunction with the sort. By default, uniq will print all of the lines of input and remove any duplicates. However, there are a few options that enable you to adapt this default behavior.

  • -c: along with each line, print the number of occurrences of that line in the input
  • -d: only print lines with more than one occurrence in the input
  • -i: ignore case differences when comparing lines
  • -s N: skip the first N characters when comparing lines
  • -u: only print lines that occur once in the input
  • -w N: only compare N characters when comparing lines