Piping and file handles in Perl

Recently I developed a Perl script for processing protein sequences to extract features for machine learning. One of the tasks of this script is to generate a position-specific substitution matrix (PSSM) using PSI-BLAST. The data I’m working with includes sequences for thousands of proteins, so I didn’t really want to create an individual Fasta file for each one. But if you run PSI-BLAST with a Fasta file containing multiple sequences, a PSSM is only generated for one of the sequences (the last one I think?). I needed to come up with a way to feed sequences to PSI-BLAST one-by-one without creating thousands of small sequence files.

Two recollections helped me formulate the solution I ultimately implemented for this problem. First, I remembered that all of the BLAST programs can read data from STDIN. Second, I remembered using Unix pipes with Perl file handles before (when I was first learning Perl) to read the output of a system call, and I figured the same approach could be used to print data to a Perl file handle which would then send the data to the system call via STDIN.

After a bit of experimentation, I verified that Perl file handles can both read from and write to Unix pipes. If you want to call a system command and capture its output, you would do it like this.

open($CMD, "some_command |");
while(my $line = <$CMD>)
  # store and/or process each line of the
  # command output

On the other hand, you can also call a system command and send data to that command through a pipe (STDIN). Let’s say in your Perl program, you have some data (in a string called $data) that you want to send as input to some program (called some_program)–you would do it like this.

open($CMD, "| some_command");
print $CMD $data;

If you want to capture the output of the command to a file, simply add the > symbol to redirect to output.

open($CMD, "| some_command > output.txt");
print $CMD $data;

My script uses BioPerl to parse the Fasta data. In the final draft of my script, I simply use the open command to create a new file handle for each sequence, pass that file handle to a Bio::SeqIO object, and write the sequence to it.

while($seq = $instream->next_seq())
  my $id = $seq->id;
  open($CMD, "| psiblast -out_ascii_pssm $id.pssm ...some more arguments...");
  my $oustream = Bio::SeqIO->new( -fh => $CMD, -format => 'Fasta' );

  # Once the previous command finishes, I can now process the PSSM file
  # stored in "$id.pssm"

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s