RTFM: touch, ls, xargs, cut, sort, uniq

This is the second installment of my quest to explore and better understand some common UNIX commands I thought I already knew. I had planned on covering sed and awk in this installment, but it turns out that these tools are more complex than I thought. They are not simply command line utilities, they are powerful text manipulation tools with languages of their own. Perhaps I will review them in the future, but for the time being I cannot give these tools the attention they deserve for a thorough treatment.

touch

The touch command is designed to update timestamps associated with files. Running touch on a file or set of files will update the access and modification timestamps to the current time. If you try to touch a file that does not exist, the command will create an empty file with that filename for you.

This is a very simple command (borderline trivial), and in the past I have used it almost exclusively for creating new empty files. However, updating file timestamps can be useful in a variety of contexts and knowing how to do this programmatically is useful. For example, in many supercomputing environments, some disk partitions are checked regularly and any file that has not been accessed or modified in the last 7 days is deleted. If you have some important data files on a scratch disk that you’re not currently using, but don’t want to lose (you’re going to use them soon), then updating the timestamps on the data files is what you need to do. Of course, you can open each file in vim or nano, but this becomes ridiculous if you have a lot of data files. Instead, simply run touch on the files to update their timestamps.

Here are a few useful options I just learned.

  • -a: change only the access time, not the modification time
  • -c: only update timestamps for existing files, do not create any new files
  • -d STRING: instead of using the current time, set the timestamp to STRING; the value of STRING is flexible and can be anything from Sun, 29 Feb 2004 16:21:42 -0800 to 2004-02-29 16:21:42 to next Thursday
  • -m: change only the modification time, not the access time

ls

The ls command is one of the first any UNIX user learns. It is used to list the contents of the current working directory. In my UNIX working environments, I alias the command list to ls -lhp. There are a few other options for sorting and otherwise managing the output of this command.

  • -a/-A: list all directory contents, including hidden files beginning with ., the current working directory ., and the parent directory .. (the -A option excludes these last two)
  • -d: treat files normally, but for directories, list the directory itself instead of directory contents
  • -h: make output more human readable (such as when used with -l option)
  • -l: print using a detailed listing format
  • -p: append / indicator to directories
  • -r: reverse the order of sorting
  • -R: list subdirectories recursively
  • -S: sort by file size
  • -t: sort by modification time
  • -X: sort by file extension

xargs

The xargs command is used to dynamically build and execute commands on the command line. Typically the output of other commands or programs is piped into xargs, which is then used to dynamically run commands based on the output. Here are some useful options.

The basic usage of xargs is as follows.

somecommand | xargs -I % someothercommand -arg1 value1 -arg2 % sometext

The -I option indicates which character(s) in the following commands should be replaced by the xargs input (in this example I chose the % character, but you’ve got some flexibility there). For example, if the somecommand command generates two lines of output (foo and bar), then the commands executed by xargs would be as follows.

someothercommand -arg1 value1 -arg2 foo sometext
someothercommand -arg1 value1 -arg2 bar sometext

Here is a non-trivial example.

find /data/runs -mindepth 4 -type d | xargs -I % mv % /data/backup

This command will start in the /data/runs directory, look for any directories that are nested 4 levels deep, and them move them to the /data/backup directory.

There are a few options that allow you to make slight modifications to xargs behavior.

  • -a file: read input from the given file rather than from the standard input
  • -d delim: use the given delimiter instead of the default newline character

cut

The cut command is designed to process data files (especially file in tabular format) and extract out relevant data. For example, if you have a tab- or comma-delimited file with several columns, the cut command can be used to cut out particular columns from that file. This is a very useful command in bioinformatics, despite the fact that it’s pretty simple and straightforward. However, the manual did teach me a few options that I wasn’t aware of before. Here are some helpful options.

  • -d delim: use the given delimiter instead of the default tab character
  • -f FIELDS: extract the given fields (columns) from the file (separate field/column numbers with commas)
  • --complement: extract the complement of the fields specified by -f
  • -s: only process lines that contain the delimiter; this can be useful for files that contain comments or other types of metadata that you don’t want to process

sort

The sort command will (you guessed it!) sort the lines of input. Looking at the sort manual didn’t reveal any spectacularly interesting options for this command, but it does provide a variety of different ways to sort the input (in contrast to the default ascii-cographical order). Knowing these options is helpful.

  • -d: sort by dictionary order, only considering blanks and alphanumeric characters
  • -f: case-insensitive sort
  • -h: sort by human-readable number value (the human-readable values generated by other UNIX commands, such as 2K or 1G)
  • -n: numeric sort
  • -R: random sort
  • -r: reverse the natural order of the sort
  • -o FILE: write output to FILE instead of the standard output

uniq

The uniq command is useful for reporting and counting duplicated lines of input. This command expects the input to be sorted ascii-cographically, so it is often used in conjunction with the sort. By default, uniq will print all of the lines of input and remove any duplicates. However, there are a few options that enable you to adapt this default behavior.

  • -c: along with each line, print the number of occurrences of that line in the input
  • -d: only print lines with more than one occurrence in the input
  • -i: ignore case differences when comparing lines
  • -s N: skip the first N characters when comparing lines
  • -u: only print lines that occur once in the input
  • -w N: only compare N characters when comparing lines
Advertisements

2 comments

  1. Pingback: RTFM: paste « BioWize
  2. Pingback: Shuffling columns of a tabular file with cut and paste | BioWize

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s