RTFM: grep, find, wc, head, tail

We are in an age where success in biology is eventually going to require some data processing. There is only so much you can do with Excel, so basic programming skills and familiarity with the UNIX command line are becoming increasingly essential to biologists.

When I was first cutting my bioinformatics teeth, I would typically approach data processing tasks by writing a dedicated script or program. As I gained more experience with programming and with UNIX shell commands, my scripts become shorter, more concise, and more efficient, and I began to pipe these together with shell commands on a regular basis. Also, since so many of my data processing tasks were one-off jobs, I started replacing my scripts with short little in-line programs that I wrote and executed all in a single shell command. This makes it even easier to leverage powerful command line tools that UNIX offers.

Despite the experience I have gained, I know there is a lot more worth learning. In a recent post, I lamented about the fact that I’ve been using Perl one-liners to do a simple task I could have been using a UNIX command for all this time. I decided that taking the time to really familiarize myself with 10-20 of the most common UNIX commands will in the long run save me time and make me a better, more efficient biologist.

So here is the first installment of my exploration into UNIX commands that I thought I already knew how to use! 🙂

A quick note: I primarily used the Fedora Linux CLI and man pages for this exercise. I understand that Linux ≠ UNIX, so it’s possible that not all of the details I discuss here will be portable across all UNIX distributions. However, most of them should be pretty consistent across different Linux distributions and OS X, which are the OSs used by the vast majority of scientists that need to do any appreciable amount of programming or data processing.

grep

The grep command is used to search a file for a given word or pattern and print any line containing the word/pattern. I use this tool frequently, but found a few new options of which I was not previously aware.

  • -c: Instead of printing the matching lines, this option will instead print the number of matching lines. Already a time saver! When I want to count the number of lines spit out by grep, I typically pipe the output into a wc -l command. This will save me the hassle and may even improve performance when processing large data files!
  • -l/-L: Using the -l option will tell grep not to print out the lines matching the given word/pattern, but to print out the names of any files containing at least one match. This can be helpful when processing mutliple files. The -L option does just the opposite: instructs grep to print out the names of files that do not contain a match.
  • -m NUM: This option instructs grep to stop searching a file after NUM matches are found.
  • -n: An extremely useful feature, this option instructs grep to print out the line number before each line printed to the terminal. This can be very helpful when searching code or data files.
  • -A/-B NUM: These options allow you to control the context of each match the grep prints to the terminal. The -B option will print NUM leading lines before each match, and the -A option will print NUM trailing lines after each match.

Other useful options I use frequently are -i, -v, -w, -x, and -r. These are definitely worth looking up.

find

The find command is used to search the directory structure, print out filenames, and possibly process the files. In the past, I’ve really only used this command in two ways: to look for a single particular file or to recursively print out every single filename in a directory (which I then usually pipe to grep to filter). After studying the manual a bit, though, find is a much more powerful tool than I realized. Here are some useful options.

  • -d/-depth: process directory contents before the directory itself (depth-first search)
  • -mindepth/-maxdepth: control how deep or shallow you want the command to descend into the directory structure
  • -empty: only process empty files or directories
  • -executable/-readable/-writable: only process executable files or directories that match the given permissions
  • -anewer file: only process files that were accessed more recently than the file file was modified
  • -name pattern: only process files that match the pattern pattern
  • -newer file: only process files that were modified more recently than the file file
  • -path pattern: only process files whose full path matches the pattern pattern
  • -size n: only process files that use n units of space (use -n for less than n and +n for more than n)
  • -delete: delete any matching file or directory; I would definitely have used this option a lot had I known about it!
  • -exec/-execdir command: execute the command command for each matching file (-execdir option runs that command from the directory containing the matched file); see the manual for details about placement of the matching filename in the command
  • -printf format: for each match, print a string using the format format; see the manual for a huge list of useful escapes and directives

There are other potentially useful options to be sure, but these are the ones that got me excited.

wc

The wc command is used to get word counts and various other counts for a given file (or each file in a given list of files). I use wc almost exclusively to get line counts, but there are a couple other potentially useful features.

  • -m: print the number of characters in the file
  • -l: print the number of lines (newline characters) in the file
  • -L: print the length of the longest line in the file
  • -w: print the number of words in the file

Not quite as powerful as the previous two commands, but still very useful.

head

By default, the head command prints out the first 10 lines of a given file. If multiple files are given, it prints out a small header with the filename before each one. This is a convenient way do a quick inspection of a file without opening it. It’s also a pretty simple command, but you do have a bit of control over the output.

  • -c K: print the first K bytes of each file; prepend K with - to print all but the last K bytes of each file
  • -n K: print the first K lines of each file; prepend K with - to print all but the last K lines of each file
  • -q: do not print headers (e.g. when processing multiple files)

tail

The tail command is complementary to the head command and works in a very similar way. The difference is that head shows the beginning of the file, while tail shows the end of the file. Here are the cognate options to head.

  • -c K: print the last K bytes of each file; prepend K with + to start printing at the Kth byte until the end of the file
  • -n K: print the last K lines of each file; prepend K with + to start printing at the Kth line until the end of the file
  • -q: do not print headers (e.g. when processing multiple files)
Advertisements

2 comments

  1. Pingback: BioWize
  2. Pingback: RTFM: paste « BioWize

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s