Category: Sysadmin

Data backup with crontab and iput

This week we had a system error which led to catastrophic data loss in my lab. Fortunately it was restricted to a single virtual machine, but unfortunately that virtual machine happened to be my personal workbench. Inconvenient losses include data, scripts, and notes from recent projects and from just about all of my graduate coursework. The absolutely devastating loss, however, was my electronic lab notebook, which was hosted as a wiki on the machine. Luckily I had done some backups of my lab notebook, but the most recent one I could find was from May 2013. As happy as I am to have avoided losing my entire graduate lab notebook, losing 8-9 months’ worth is still heartbreaking.

So I just finished doing what I should have done a long time ago: automate my backup procedure. Let me introduce the crontab command. The crontab command lets you edit a system file which specifies commands to be run at regular intervals by your system. Using crontab, you can set up cron jobs to run hourly, daily, or weekly, with lots of flexibility. Here are a few examples.

# Execute 'somecommand' command at the beginning of every hour
0 * * * * somecommand

# Execute 'somecommand' at 1:00am and 1:00pm every day.
0 1,13 * * * somecommand

# Execute '/home/standage/check-submissions' every minute between the hours 
# of 12am-2am on Mondays and Wednesdays.
* 0-2 * * 1,3 /home/standage/check-submissions

You can run man crontab on your system for a complete description of available options.


So as far as the specifics of my backup procedure, I decided a weekly backup would be sufficient: specifically, at 2am on Saturday morning when the chances of me doing any research are pretty slim. So I ran crontab -e to open my crontab file, and added the following entry.

0 2 * * 7 /home/standage/bin/wiki-backup

The file /home/standage/bin/wiki-backup is an executable bash script that includes that commands needed to perform each backup. This particular script creates a gzip-compressed tar archive of my lab notebook, and then copies it over the network to my directory in the iPlant data store using the iput command. If I had Box or Dropbox installed on that machine, I could just have easily replaced the iput command with a cp command that copies the data backup to the directory on my local system that syncs with the cloud.

#!/usr/bin/env bash

cdate=`date '+%F'`
backdir=/home/standage/Backups/labnotebook
backfile="labnotebook.$cdate.tar.gz"

cd /var/www/html
tar czf $backdir/$backfile labnotebook

cd $backdir
iput -V $backfile Backups

Hopefully this example gives a clear idea of what is possible with cron jobs and how easy it is to set up automatic backups for your critical research data.

Advertisements

Executable makefiles

I’ve talked before about the merits of using Make and makefiles to implement bioinformatics pipelines. But I had a revelation today: just as one can use the shebang (#!) to run Perl, Python, Ruby, Bash, or other scripts without calling the interpreter directly, I should be able to use the shebang to make a makefile executable. Simply place #!/usr/bin/env make on the first line of the makefile, make sure you have execute permissions, and then you’re all set, right?

Well, there’s one gotcha. The problem with using just #!/usr/bin/env make as the shebang is that the name of the script being executed is implicitly placed as the final argument on the shebang line. Make therefore treats the makefile as a target rather than an actual makefile. With that in mind, the fix is simple: add the -f flag to indicate the script is a makefile and not a target. A big thank you goes out to this StackOverflow thread for providing a clear solution to this problem.

The example below uses a trivial makefile to demonstrate how executable makefiles can be written and executed.

[standage@lappy make-demo] ls -lh run-it 
-rwxr-xr-x  1 standage  staff   218B Nov 21 23:53 run-it
[standage@lappy make-demo] cat ./run-it 
#!/usr/bin/env make -f

MESSAGE=Dude

all:            shoutout.txt dup.txt
                

clean:          
                rm -f shoutout.txt dup.txt

shoutout.txt:   
                echo $(MESSAGE) > shoutout.txt

dup.txt:        shoutout.txt
                cp shoutout.txt dup.txt
                cat dup.txt
[standage@lappy make-demo] ./run-it 
echo Dude > shoutout.txt
cp shoutout.txt dup.txt
cat dup.txt
Dude
[standage@lappy make-demo] ./run-it clean
rm -f shoutout.txt dup.txt
[standage@lappy make-demo] ./run-it MESSAGE=Sweet
echo Sweet > shoutout.txt
cp shoutout.txt dup.txt
cat dup.txt
Sweet
[standage@lappy make-demo]

Batch installation of R packages

Recently, I was writing a shell script to automate the installation of some bioinformatics tools on a particular platform. One of the tools depended on the R package “VGAM”. Installing packages manually is trivial—you simply fire up R, type install.packages("YourPackageName"), select your closest mirror, and then BAM! you’re done.

I’ve used the shebang-able Rscript before to run R one-liners on the command line (try Rscript -e 'rnorm(5)' on your prompt), so I figured batch installation of packages would be just as simple, right? Well, yes and no. If you try to just run the install.packages function as is using Rscript, the command will fail since no mirror is specified.

[standage@lappy ~] Rscript -e 'install.packages("VGAM")'
Error in contrib.url(repos, type) : 
  trying to use CRAN without setting a mirror
Calls: install.packages -> .install.macbinary -> contrib.url
Execution halted

A bit of Google searching provided a couple of solutions to this problem: either use the chooseCRANmirror function…

# Use the getCRANmirrors() function for a list of all possible mirrors
chooseCRANmirror(76)
install.packages("VGAM")

…or simply pass the mirror URL as the repos argument.

install.packages("VGAM", repos="http://ftp.ussg.iu.edu/CRAN/")

I ended up going with the second one, and it worked beautifully!

RTFM: paste

Recently I’ve used this blog to document my quest for better understanding of basic UNIX commands and their utility in data processing (here here and here). I recently came across another command that has already proven itself extremely useful.

paste

If you read the manual for the paste command or try the default usage via the command line, you will surely be underwhelmed. This is from the man page…

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

…and this is the basic usage.

[standage@lappy ~] cat bogus.data 
dude
sweet
awesome
cool
[standage@lappy ~] paste - < bogus.data 
dude
sweet
awesome
cool
[standage@lappy ~]

Looks like just another cat command, huh? Well, note the dash symbol following the paste command. This is commonly used to indicate that a program/command should read from standard input rather than from a file. The paste magic begins when you start adding additional dashes.

[standage@lappy ~] paste - - < bogus.data 
dude	sweet
awesome	cool
[standage@lappy ~]

This time it read two lines and printed them out on a single line separated by a tab. Increasing the number of dashes will increase the number of lines from the input that paste will print together on one line of output. This is a simple command, but it can be extremely useful for processing data files where a complete record is stored on a fixed number of lines (such as with the fastq format, where a sequence corresponds to 4 lines).

Recently, my advisor had some interleaved Fastq files but wanted to run a software package that expected mate pairs to be placed in separate files. Before searching for a program to split the files or writing one to do it himself, he sent me a quick note asking whether we had already installed any programs on our server that would do this. I responded and suggested he try the following command.

paste - - - - - - - - < interleaved.fastq | \
perl -ne '@v = split(/\t/); printf("%s\n%s\n%s\n%s\n", @v[0..3]); printf(STDERR "%s\n%s\n%s\n%s", @v[4..7]);' \
> 1.fq 2> 2.fq

In this one-liner (well, I spread it over 3 lines for readability), the paste command reads in 8 lines (2 fastq records corresponding to a mate pair) and combines those 8 lines into a single line with 8 tab-delimited values. The Perl command then splits the input, writes the first 4 values to the standard output and the second 4 values to the standard error. Redirect stdout and stderr to different files, and you’ve got your paired Fastq files!

Credit for introducing me to this command goes to this post, which has a couple additional examples.

RTFM: comm

Comparing lists of IDs, filenames, or other strings is something I do on a regular basis. When I was an undergrad, I remember using a Perl script someone in our lab had written to look at two files and perform simple set operations (pull out the intersection of two lists, or the union, or unique values from one list or the other). Over the years, as the need to perform such tasks has frequently recurred, I’ve repeatedly had to dig through my old files looking for the script.

Recently, the need to do some set operations came up again, but rather than scraping around for this script I figured I should learn how to Do It the Right Way, e.g., perform the task using standard UNIX command(s). Enter the comm command.

comm

I’m guessing “comm” is short for common. It is designed precisely for the use case I described above. It takes two files (assumed to be sorted lexicographically) and produces 3 columns of output. The first column corresponds to values found only in the first file, the second column corresponds to values found only in the second file, and the third column corresponds to values found in both files. The command has flags that enable case-insensitive comparisons and, more relevant to the question at hand, exclusion of one or more of the columns of output. For example, if you want to pull out just the values found in both file1 and file2 (the intersection), you would use the following command.

comm -12 file1 file2

If you wanted to pull out the values unique to file1 using case-insensitive comparison, you would use the following command.

comm -23i file1 file2

Today’s lesson is brought to you by this thread on ServerFault@StackExchange.

Compression with pigz

Yes, another post about compression programs. No, data compression is not an area of particular research interest for me, but I’ve been dealing with so much data recently that I’m really looking for better and quicker ways to compress, decompress, and transfer data.

The zlib website hosts the home page for pigz, a parallel implementation of the UNIX program gzip. It compiled very quickly and cleanly out-of-the-box on several platforms (Fedora, Red Hat, OS X) and works just like gzip, bzip2, or any other compression program would on the command line.

# Here is how I would compress a tarball...
tar cf - $DATA/*.fastq | pigz -p 16 > $WD/RawReads.tar.gz
# ...and here is how you would decompress.
pigz -p 16 -d -c $WD/RawReads.tar.gz | tar x

The performance improvement is significant, so initially I was very excited about this finding. However, after a few uses I did encounter a case in which I had issues decompressing a particularly large tarball that had been created with pigz. It appears that the tarball was corrupted somehow during the compression process.

Definitely a program worth checking out. I’m cautiously optimistic that my troubles have just been a fluke or the result of some mistake on my part, but I’m not betting the farm on it yet.

Displaying hidden files and system directories in Mac OS X

By default, the Mac operating system (OS X) makes hidden files and certain system directories invisible when using file open and save dialog boxes. This is (likely) to make it harder for inexperienced users to overwrite something critical to the system’s performance. However, if you know what you’re doing and need to, for instance, open a PDF manual in a subdirectory of /usr/local/src, this feature can be a hassle.

With a file open or save dialog open, press the command key (⌘), the shift key, and the period key simultaneously to show hidden files and system directories. Pressing the same key combo subsequently will toggle the view on and off.

SSH and file sharing with guest operating system

I’m a big fan of Apple products. I really like the OS X operating system, and I love the hardware on which it runs. I also love that OS X is built on a UNIX core, which complements the graphical interface with a solid command-line interface that runs a lot of scientific and open-source software out-of-the-box. However, some of the scientific tools I use are difficult (if not impossible) to configure, program, and run without a Linux operating system. So my solution for several years has been to run a Linux virtual machine via VMware or VirtualBox on my Apple hardware. When I need the Linux environment (which is most of the time), I fire up the Linux guest OS and get to work. Despite how far Linux GUIs have come, however, I still prefer OS X for web browsing, Skype, and just about anything else that doesn’t involve the command line.

This week I decided that I would invest some time figuring out how to interact with my guest OS (Linux) while staying completely in my host OS (Mac OS X)–essentially treating my guest VM as a remote machine. Essentially, this amounted to two tasks: opening up shell access and exposing the file system.

Shell access

Enabling shell access was pretty simple. On the guest side, all I had to do was install OpenSSH (sudo apt-get install openssh-server for Debian-based distros). On the host side, I simply had to create a host-only network in VirtualBox and edit my VM’s network settings. Here’s exactly what I did.

  • In VirtualBox, open Preferences and select the Network tab
  • Add a host-only network by clicking the “+” icon, and then click ok
  • Back in the main VirtualBox menu, select the virtual machin, click ‘Settings’, and select the Network tab
  • Adapter 1 should be attached to NAT, so select Apapter 2, enable it, and attach the host-only adapter just created. Click ok

At this point, the virtual machine should respond to a ping and an SSH login attempt. Just use the ifconfig command in the guest machine to determine its IP address.

File system

I do a fair amount of coding using command-line text editors such as vim and nano. However, when I’m coding for hours and hours, I do prefer using a graphical interface, so being able to open files on the guest machine (Linux) with a text editor on my host machine (OS X) required some file sharing. VirtualBox provides some extensions to facilitate file sharing from the host to the guest, but what I want required sharing the other way around–from guest to host.

I ended up installing the Samba file sharing service. I followed the instructions in these tutorials (here and here) pretty closely, so rather than rehashing those points I’ll just leave the links.

Once Samba was up and running, I was able to connect by opening Finder, selecting “Connect to server” (command-K), and then entering smb:// followed by the guest machine’s IP address.

Working with lots of files

I’ve worked with plenty of large-scale data sets before, but these are typically large-scale in the sense of file size. Recently, I worked with a large-scale data set that was quite different: nucleotide and protein sequence data for tens of thousands (maybe even hundreds of thousands) of plant species, stored in hundreds of thousands of files in the same directory.

I needed to do some basic management tasks such as indexing the data for BLAST searches. My typical approach to highly repetitive tasks is to use a bash for loop–something like this.

for db in *.fasta
do
  makeblastdb -in $db -type nucl -parse_seqids
done

However, this did not work with so many files in the directory. In fact, my entire approach to file management was absolutely useless. I couldn’t even use ls to list the files or see how many there were!

[standage@bgdata PlantDNA]$ ls *.fasta | wc -l
-bash: /bin/ls: Argument list too long
0

I did some searching on Google and found a variety of solutions. The one I like best though involves the find command.

find . -type f -name '*.fasta' -exec makeblastdb -in {} -dbtype nucl -parse_seqids \;

That did the trick!

Setting up a Vine server for secure remote desktop access to your Mac

I connect to computers remotely on a daily basis. Most of what I do can be done on the command line, so I typically use SSH. However, sometimes it is nice to have a graphical interface. In the past, I’ve used X11 port forwarding to remotely access GUIs for individual Linux programs, and then later a VNC server to give remote access to the entire desktop environment. Recently, I was very excited to learn that the same thing can be accomplished on Mac OS X using Vine server. The setup and connection process is fairly simple.

  1. Download the Vine Server installer to your Mac, open it, and copy the program to your hard drive.
  2. Run the Vine server. This will set up a VNC server accessible through port 5900.
  3. At this point, you could use VNC Viewer to connect to your Mac remotely. However, this would be an unencrypted connection. For security reasons, connect to your Mac with SSH and tunnel the remote port 5900 to an unused local port–say, port 5909.
    ssh -L 5909:localhost:5900 location.of.your.mac.edu
  4. Once the encrypted SSH connection is established, you can connect to the remote desktop by pointing your VNC Viewer to localhost:5909. VNC Viewer will warn of an unencrypted connection, but this isn’t a problem since you are connecting to a port on your local machine, which is being fed by an encrypted connection to your remote Mac.

Voilá, you can now connect securely to your remote Mac’s desktop!

P.S. I’m assuming you are making an SSH connection to your Mac from a Linux/UNIX machine or from another Mac. If you are connecting to your Mac from a Windows machine via PuTTY, see these instructions for establishing the tunnel from the Mac’s port 5900 and your local port 5909.