Comparing lists of IDs, filenames, or other strings is something I do on a regular basis. When I was an undergrad, I remember using a Perl script someone in our lab had written to look at two files and perform simple set operations (pull out the intersection of two lists, or the union, or unique values from one list or the other). Over the years, as the need to perform such tasks has frequently recurred, I’ve repeatedly had to dig through my old files looking for the script.
Recently, the need to do some set operations came up again, but rather than scraping around for this script I figured I should learn how to Do It the Right Way, e.g., perform the task using standard UNIX command(s). Enter the
I’m guessing “comm” is short for common. It is designed precisely for the use case I described above. It takes two files (assumed to be sorted lexicographically) and produces 3 columns of output. The first column corresponds to values found only in the first file, the second column corresponds to values found only in the second file, and the third column corresponds to values found in both files. The command has flags that enable case-insensitive comparisons and, more relevant to the question at hand, exclusion of one or more of the columns of output. For example, if you want to pull out just the values found in both file1 and file2 (the intersection), you would use the following command.
comm -12 file1 file2
If you wanted to pull out the values unique to file1 using case-insensitive comparison, you would use the following command.
comm -23i file1 file2
Today’s lesson is brought to you by this thread on ServerFault@StackExchange.