Concise, efficient string munging with Perl

Many problems in bioinformatics are modeled using text strings, and the Perl programming language is arguably the best tool for parsing, searching, slicing, and splicing text. With a basic understanding of regular expressions and a few other operators, text processing tasks that can be a nightmare in other programming languages are quite simple with Perl. At the same time, Perl can be very concise yet very powerful.

Here are a few commands/constructs I’ve learned recently that have significantly reduced the complexity of a script I’ve been working on.

Counting characters
Here is a simple example calculating GC content in a string of DNA. It can be adapted to many character counting scenarios. Not only is this method more concise than alternatives using regular expression matching, it also has better performance.

# My first approach, pretty naive
$gc_count = 0;
$gc_count++ while($dna =~ m/[GC]/g);

# Same idea with more concise (but cryptic!) syntax
$gc_count =()= $dna =~ m/[GC]/g;

# Better method; concise, clear, and better performance
$gc_count = $dna =~ tr/GC/GC/;

Replacing characters at the end of a string
Replacing every occurrence of a character in a string is trivial with Perl regular expressions. However, it’s a bit more complex if you want to replace only some of the given character(s). Using the e modifier, you can instruct a substitution regex to evaluate the replacement clause as code, giving you a clean concise way to handle more complex replacements. In the example below, I want to replace any combination of Us or Ns at the end of a string with Ts (this can trivially be changed to the beginning or some other point of reference in the string).

# My first method, not very concise or clear
if($v =~ m/([UN]+)$/)
  my $length = length($1);
  substr($v, (length($v) - $length), $length) = "T" x $length;

# Better method; concise and clear
$v =~ s/([UN]+)$/'T' x length($1)/e;

Find coordinates of every string match
This is a common task in sequence analysis: given a sequence and a short string, find the coordinates of every subsequence that matches the given short string.

# Clear and concise
while($dna =~ m/($matchstring)/g)
  # This just prints out the coordinates, but you could store
  # them or do whatever you want
  printf("match: %s, coords: (%d, %d)\n", $1, $-[0], $+[0]);

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s