Category: Perl

PHP one-liners

One of the big turning points in the development of my programming and data processing skills was embracing Perl not only as a language for writing scripts but also as a shell tool for basic processing tasks. For a long time I had basic familiarity with common shell tools such as grep, sort, cut, and so on, and had learned to do basic tasks by piping them together. I had also learned the power of writing short bits of Perl code directly on the command line and running it with perl -e or perl -ne. But as I’ve gained more experience with shell tools (and hey, even more experience with Perl), my default approach to data processing has changed. For many line-by-line processing tasks, stitching Perl together with other shell tools is often quicker and more concise than firing up a text editor and writing the whole processing procedure in Perl.

I’m a huge proponent of using the right tool for the right job. Some languages are designed and optimized for certain types of tasks, and of course a programmer’s experience and preference influence what “the right tool” for a particular job is. My first programming experience was with PHP, and I occasionally find myself “thinking in PHP” while I’m actually writing code in other languages, or wishing I could leverage one of PHP’s built-in functions without implementing the entire script in PHP.

Well, I came across this underwhelming post the other day. It looks like php -r is all that has been missing in my life…

Advertisements

Scripting a search and replace with Perl

I use Perl one-liners and the sed command quite frequently to do search-and-replace tasks with data and code files. For search-and-replace tasks that can’t easily be reduced to a s/search/replace/g command, Perl provides support for scripting within regular expressions.

I recently took advantage of this capability for a data processing task I had. I had a GFF3 file for which I needed to reformat all of the sequence IDs. The old IDs followed the pattern scaffold_0, scaffold_1, etc, while the new IDs needed to follow the pattern Scaf0001, Scaf0002, etc. I couldn’t simply replace each instance of scaffold_ with Scaf, since that would not pad with leading 0s as was required. The following Perl one-liner did the trick (you may have to scroll right to see the whole thing).

perl -ne 's[scaffold_(\d+)]{$num = $1; $num++; $num = "0". $num while(length($num) < 4); qq[Scaf$num]}eg; print' < old.gff3 > new.gff3

Here’s the Perl broken up.

s[scaffold_(\d+)]
{
  $num = $1;
  $num++;
  $num = "0". $num while(length($num) < 4);
  qq[Scaf$num]
}eg;
print

This is a very useful feature, kludgy syntax notwithstanding.

Perl code standards: perltidy and perlcritic

When writing scientific software, my most important consideration (behind correctness and speed) is the maintainability of the source code. This is important not only so that others can reproduce my contributions to the community, but also for my own sake and sanity. Trying to edit source code you’ve ignored for 6 months (especially Perl code) is hard enough anyway. If you don’t take the time to write clean code, it just makes it that much more frustrating.

This week I discovered two static code analysis tools for Perl that should be useful to anyone concerned about the quality of their code: perltidy and perlcritic. The latter is designed to critique your code based on published best practices, while the former is designed to tidy up your code for you. perltidy is a script that is run from the command line, while perlcritic can be run from the command line, from your own code via the Perl::Critic module, or from their web service.

Over the last few years, I’ve developed my own ad hoc standards for clean Perl code. For the important details, my code is pretty similar to the standards recommended and enforced by perltidy and perlcritic. However, as I increase the severity of the perlcritic critique, it finds quite a few issues with my code. I’m not concerned that all of this is necessarily an indication of bad code–after all, coding style and maintainability is as much about style and preference as it is about structure and correctness. However, I am going to review each recommendation seriously and improve my code accordingly (unless I have a darn good reason to ignore the warning).

For a biologist just starting out with Perl programming, these two tools are a wonderful way of helping you to write good, clean, and maintainable Perl code!

Piping and file handles in Perl

Recently I developed a Perl script for processing protein sequences to extract features for machine learning. One of the tasks of this script is to generate a position-specific substitution matrix (PSSM) using PSI-BLAST. The data I’m working with includes sequences for thousands of proteins, so I didn’t really want to create an individual Fasta file for each one. But if you run PSI-BLAST with a Fasta file containing multiple sequences, a PSSM is only generated for one of the sequences (the last one I think?). I needed to come up with a way to feed sequences to PSI-BLAST one-by-one without creating thousands of small sequence files.

Two recollections helped me formulate the solution I ultimately implemented for this problem. First, I remembered that all of the BLAST programs can read data from STDIN. Second, I remembered using Unix pipes with Perl file handles before (when I was first learning Perl) to read the output of a system call, and I figured the same approach could be used to print data to a Perl file handle which would then send the data to the system call via STDIN.

After a bit of experimentation, I verified that Perl file handles can both read from and write to Unix pipes. If you want to call a system command and capture its output, you would do it like this.

open($CMD, "some_command |");
while(my $line = <$CMD>)
{
  # store and/or process each line of the
  # command output
}
close($CMD);

On the other hand, you can also call a system command and send data to that command through a pipe (STDIN). Let’s say in your Perl program, you have some data (in a string called $data) that you want to send as input to some program (called some_program)–you would do it like this.

open($CMD, "| some_command");
print $CMD $data;
close($CMD);

If you want to capture the output of the command to a file, simply add the > symbol to redirect to output.

open($CMD, "| some_command > output.txt");
print $CMD $data;
close($CMD);

My script uses BioPerl to parse the Fasta data. In the final draft of my script, I simply use the open command to create a new file handle for each sequence, pass that file handle to a Bio::SeqIO object, and write the sequence to it.

while($seq = $instream->next_seq())
{
  my $id = $seq->id;
  open($CMD, "| psiblast -out_ascii_pssm $id.pssm ...some more arguments...");
  my $oustream = Bio::SeqIO->new( -fh => $CMD, -format => 'Fasta' );
  $outstream->write_seq($seq);
  $outstream->close();

  # Once the previous command finishes, I can now process the PSSM file
  # stored in "$id.pssm"
}

Comparing human genome annotation builds

I have spent most of my time, both as an undergraduate and graduate student, working with data from plant genomes. I feel fairly comfortable finding and navigating most publicly available plant data resources. This week I decided to test some annotation comparison software I’ve written recently with the two latest builds of the human genome. This required a lot more work than I expected, so here I have documented the issues I encountered and the solutions I came up with.

The first complication is that there is no “definitive” set of gene structure annotations for the human genome. UCSC releases annotations with every new assembly of the human genome (with labels like “hg18” and “hg19”). Ensembl releases annotations much more frequently, resulting in multiple annotation sets per assembly (they use labels like “NCBI36.54” or “GRCh37.55”, where NCBI36 = hg18, GRCh37 = hg19, and the last number refers to the Enseml build number). Considering that UCSC and Ensembl have different gene annotation pipelines/methodologies, choosing a representative pair of genome annotations isn’t necessarily an objective process. I didn’t delve too deeply into this: for simplicity’s sake, I simply downloaded the latest two builds (GRCh37.61 and GRCh37.62) from Ensembl’s FTP site.

The other complication is that I’ve worked almost entirely with GFF3-formatted annotations in the past, while it seems that human genome annotations are only available in GTF (legacy) or BED (newfangled) formats. I don’t have anything against either of these formats, it’s just that my processing and analysis tools expect GFF3. It’s always a pain when there are 6 different data formats to deal with. To make things even more interesting, there are slight differences between how UCSC uses GTF and how Ensembl uses GTF, so some software will not work with one or the other. Whatever…

Today I hacked together a pipeline to convert the data to the format I needed. Below are the commands I used, corresponding explanations for each command, and relevant source code.

perl ensembl_gtf_to_gff.pl < Homo_sapiens.GRCh37.61.gtf > Homo-sapiens-unfiltered-noutrs-nointrons.gff3
grep '^chr[0-9][0-9]*\|^chr[XY]' Homo-sapiens-unfiltered-noutrs-nointrons.gff3 > Homo-sapiens-noutrs-nointrons.gff3
perl add_utrs.pl < Homo-sapiens-noutrs-nointrons.gff3 > Homo-sapiens-nointrons.gff3
gt gff3 -addintrons -retainids -sort -tidy Homo-sapiens-nointrons.gff3 > Homo-sapiens.gff3
  1. First, I tracked down an existing GTF-to-GFF3 converter (from this thread). I modified it slightly to filter out non-protein-coding genes and to extract CDS fragments in addition to exons (see ensembl_gtf_to_gff.pl below) and converted the Ensembl GTF data to GFF3 format.
  2. Next I used grep to filter out any genes that were not mapped to a chromosome in the nuclear genome.
  3. Then, I wrote a script (see add_utrs.pl below) to add explicit UTR locations (as inferred from the exon and CDS features) and ran it on the filtered GFF3 data.
  4. Finally, I used GenomeTools to add explicit intron locations and sort/tidy the data.

I repeated this process for both sets of annotations to get the two GFF3 files I need. It took most of the morning to figure this all out, but I finally have data in the format I need!

ensembl_gtf_to_gff.pl

use strict;

my @v;
my ($g,$tr,$last_tr,$last_g);
my @exons=();
my @cdss=();
my @trs=();
my $counter = 0;

while (<>) {
	@v = split /\t/;
	next unless $v[1] eq "protein_coding" and ($v[2] eq 'exon' or $v[2] eq 'CDS');
	$v[0] = 'chr' . $v[0] unless index($v[0], 'chr') == 0;
	($g,$tr) = $v[8] =~ /gene_id "(ENSG.{11})"; transcript_id "(ENST.{11})/;
	if ($tr ne $last_tr) {
		push @trs, [@exons];
		if ($g ne $last_g) {
			process(@trs) if defined $last_g;
			@trs=();
			$last_g = $g;
			print STDERR '.' if $counter % 10000 == 0;
		}
		$last_tr = $tr;
		@exons = [@v];
	} else {
		push @exons, [@v];
	}
}
process(@trs);

sub process {
	my @trs = @_;
	return unless scalar @trs;
	my @all_starts = ();
	my @all_ends = ();
	my @out = ();
	my @tr_here;
	my ($gid, $tid, $exn, $gname);
	my (@starts, @ends, $start, $end);
	my (@transcript, @gene);
	my $exon_info_string;


	($gid) = $trs[0]->[0]->[8] =~ /gene_id "([^"]+)"/;
	($gname) = $trs[0]->[0]->[8] =~ /gene_name "([^"]+)"/;
	$gname =~ tr/;/./;

	foreach my $exons (@trs) {
		@starts = sort {$a<=>$b} map { $_->[3] } @$exons;
		@ends = sort {$b<=>$a} map { $_->[4] } @$exons;
		$start = $starts[0];
		$end = $ends[0];

		($tid) = $exons->[0]->[8] =~ /transcript_id "([^"]+)"/;
		$exon_info_string = "Parent=$tid\n";

		@transcript = @{$exons->[0]};
		$transcript[2] = 'mRNA';
		$transcript[3] = $start;
		$transcript[4] = $end;
		$transcript[8] =  "ID=$tid;Name=$gname;Parent=$gid\n";

		push @all_starts, $start;
		push @all_ends, $end;
		foreach (reverse @$exons) {
			($exn) = $_->[8] =~ /exon_number "(\d+)"/;
			$_->[8] = $exon_info_string;
			push @out, join("\t", @$_);
		}
		push @out, join("\t", @transcript);
	}
	@all_starts = sort {$a<=>$b} @all_starts;
	@all_ends = sort {$b<=>$a} @all_ends;
	@gene = @transcript;
	$gene[2] = 'gene';
	$gene[3] = $all_starts[0];
	$gene[4] = $all_ends[0];
	$gene[8] = "ID=$gid;Name=$gname\n";
	push @out, join("\t", @gene);
	print reverse @out;
}

add_utrs.pl

use strict;
use Bio::Tools::GFF;

my $loader = Bio::Tools::GFF->new( -fh => \*STDIN,  -gff_version => 3 );
my $writer = Bio::Tools::GFF->new( -fh => \*STDOUT, -gff_version => 3 );

my $current_trns;
my $current_trns_id = "";
my @current_exons = ();
my @current_cdss = ();

while(my $feature = $loader->next_feature())
{
  my $type = $feature->primary_tag();
  if($type eq "gene")
  {
    $writer->write_feature($feature);
  }
  elsif($type eq "CDS")
  {
    push(@current_cdss, $feature);
  }
  elsif($type eq "exon")
  {
    push(@current_exons, $feature);
  }
  elsif($type eq "mRNA")
  {
    if($current_trns_id ne "" and scalar(@current_cdss) > 0)
    {
      process();
    }
    ($current_trns_id) = $feature->get_tag_values("ID");
    $current_trns = $feature;
    @current_exons = ();
    @current_cdss = ();
  }
  else
  {
    die("Unknown feature type '$type' $!");
  }
}
if(scalar(@current_cdss) > 0)
{
  process();
}

sub process
{
  my $cds_start = -1;
  my $cds_end = -1;
  foreach my $cds(@current_cdss)
  {
    if($cds_start == -1 or $cds->location->start < $cds_start)
    {
      $cds_start = $cds->location->start;
    }
    if($cds_end == -1 or $cds->location->end > $cds_end)
    {
      $cds_end = $cds->location->end;
    }
  }
  my @utrs = ();
  my $front_type = "five_prime_UTR";
  my $back_type = "three_prime_UTR";
  if($current_trns->location->strand == -1)
  {
    $front_type = "three_prime_UTR";
    $back_type = "five_prime_UTR";
  }   
  foreach my $exon(@current_exons)
  {   
    if($exon->location->start < $cds_start)
    {   
      my $utr_start = $exon->location->start;
      my $utr_end = $exon->location->end;
      if($exon->location->end >= $cds_start)
      {   
        $utr_end = $cds_start - 1;
      }   
      my $utr = Bio::SeqFeature::Generic->new
      (   
        -seq_id => $exon->seq_id,
        -source_tag => $exon->source_tag,
        -primary => $front_type,
        -start => $utr_start,
        -end => $utr_end,
        -strand => $exon->location->strand,
        -frame => ".",
      );  
      $utr->add_tag_value("Parent", $current_trns_id);
      push(@utrs, $utr);
    }   
    if($exon->location->end > $cds_end)
    {   
      my $utr_start = $exon->location->start;
      my $utr_end = $exon->location->end;
      if($exon->location->start <= $cds_end)
      {   
        $utr_start = $cds_end + 1;
      }
      my $utr = Bio::SeqFeature::Generic->new
      (
        -seq_id => $exon->seq_id,
        -source_tag => $exon->source_tag,
        -primary => $back_type,
        -start => $utr_start,
        -end => $utr_end,
        -strand => $exon->location->strand,
        -frame => ".",
      );
      $utr->add_tag_value("Parent", $current_trns_id);
      push(@utrs, $utr);
    }
  }
  $writer->write_feature($current_trns);
  foreach my $feat((@current_exons, @current_cdss, @utrs))
  {
    $writer->write_feature($feat);
  }
}

Text wrapping with dot (graphviz)

Recently I have been studying a software application that introduces some novel algorithms for correcting errors in next-generation sequencing data. To better understand exactly what one algorithm was doing, I wrote down the logic (flowchart style) on a piece of paper. This was a very helpful process. However, I wanted to copy a preserve electronically. Since my pseudo-flowchart was pretty much a binary tree, I decided I would create a graphic with the graphviz tool.

I learned the bare essentials of graphviz a few years ago, so after looking at some of my old examples I created this .dot file.

digraph
{
  node [color=Blue,shape=box]

  1.1 [label="Frequency of t exceeds upper threshold"]
  2.1 [label="t has d-mutant tiles"]
  2.2 [label="Valid"]
  3.1 [label="Frequency of t exceeds lower threshold"]
  3.2 [label="Frequency of t exceeds lower threshold"]
  4.1 [label="Insufficient evidence"]
  4.2 [label="Valid"]
  4.3 [label="t has only one d-mutant that exceeds lower threshold"]
  4.4 [label="Are there any d-mutant tiles with significantly higher frequencies?"]
  5.1 [label="Insufficient evidence"]
  node [color=Green] 5.2 [label="Correct t to t'"] node [color=Blue]
  5.3 [label="t has a d-mutant tile t' that is closer than all other d-mutant tiles and for which a corrected base has a higher quality score"]
  5.4 [label="Valid"]
  6.1 [label="Insufficient evidence"]
  6.2 [label="t' is unique"]
  7.1 [label="Insufficient evidence"]
  node [color=Green] 7.2 [label="Correct t to t'"] node [color=Blue]

  1.1 -> 2.1 [label="no"]
  1.1 -> 2.2 [label="yes"]
  2.1 -> 3.1 [label="no"]
  2.1 -> 3.2 [label="yes"]
  3.1 -> 4.1 [label="no"]
  3.1 -> 4.2 [label="yes"]
  3.2 -> 4.3 [label="no"]
  3.2 -> 4.4 [label="yes"]
  4.3 -> 5.1 [label="no"]
  4.3 -> 5.2 [label="yes"]
  4.4 -> 5.3 [label="no"]
  4.4 -> 5.4 [label="yes"]
  5.3 -> 6.1 [label="no"]
  5.3 -> 6.2 [label="yes"]
  6.2 -> 7.1 [label="no"]
  6.2 -> 7.2 [label="yes"]
}

The command to create a .png graphic from this file was easy.

dot -Tpng -o test.png test.dot

The result was almost what I was looking for, but not quite (click on the image for full-size version).

Tile error correction: wide

The problem is that the boxes containing a lot of text are are too wide. Unfortunately, graphviz does not allow automatic text wrapping, so you have to control box width by manually inserting \n characters into the box labels. Anyone that has done something like this before knows how tedious this process can be, and how frustrating it is to redo when you need to change the label text.

Instead, I wrote a simple Perl script that will automatically add line breaks to labels so that the label width does not exceed some given number.

#!/usr/bin/perl
use strict;

my $usage = "setdotlabelwidth [char-width] < [dotfile]";
my $width = shift() or die("Usage: $usage $!");

while(<STDIN>)
{
  if(m/label="(.*?)"/)
  {
    my $labeltext = $1;
    my @words = split(/ /, $labeltext);
    my @newtext = ();
    my $newline = "";
    foreach my $word(@words)
    {
      if( length($newline) > 0 and
          length($newline) + length($word) > $width )
      {
        push(@newtext, $newline);
        $newline = "";
      }
      $newline .= " " if( length($newline) > 0 );
      $newline .= $word;
    }
    push(@newtext, $newline) if( length($newline) > 0 );
    my $newlabel = join("\\n", @newtext);
    s/label=".*?"/label="$newlabel"/;
  }
  print;
}

I saved this program as setdotlabelwidth, ran it, and simply piped the output into graphviz. In this case, I set the width to 35 characters.

./setdotlabelwidth 35 < tile-error-correction.dot | dot -Tpng -o tile-error-correction.png

Voilà, the graphic looks much nicer now! Again, click on the image for the full-size version.

Tile error correction

Perl documentation with POD and Pdoc

I just finished writing a Perl module yesterday–it’s an implementation of the interval tree data structure. For code like this (anything that I may ever want to reuse and/or share), I try to include some documentation. In Perl, that usually means a few comment lines here and there. This is sufficient small scripts, but for larger programs or modules there are better alternatives (i.e. Perl’s Plain Old Documentation or POD). I figured this new module be a great chance for me to try out some new documentation approaches.

Perhaps the biggest reason I wanted to try out POD is that you can process POD-documented modules and produce nice HTML files. I use the BioPerl docs quite often (here is an example), and I wanted to create something like that for my module. I knew little about POD, so I figured I would look at the source code of several familiar BioPerl modules on my system. Surely enough, the files included the POD, so I used this as an example of how to document my module. All in all, it took less than an hour to add some comprehensive documentation. Here is a sample of the code.


=head2 find

 Title   : find
 Usage   : my @overlapping_intervals = $itree->find( 27000, 310000 );
 Function: Finds all of the intervals in the tree that overlap with the given
           interval
 Returns : An array of interval objects
 Args    : Two integers, representing the start and end positions of an interval

=cut
sub find
{
  my($self, $start, $end) = @_;
  my @overlapping = ();

  # Grab overlapping intervals from this node
  foreach my $interval(@{$self->{intervals}})
  {
    push(@overlapping, $interval) if($interval->end() >= $start and $interval->start() <= $end);
  }
  # Grab overlapping intervals from the left subtree
  if( exists($self->{left}) and $start <= $self->{center} )
  {
    @overlapping = (@overlapping, $self->{left}->find($start, $end));
  }
  # Grab overlapping intervals from the right subtree
  if( exists($self->{right}) and $end >= $self->{center} )
  {
    @overlapping = (@overlapping, $self->{right}->find($start, $end));
  }

  return @overlapping;
}

1;

Once I was finished, I ran pod2html on the module source code and got this (source here). Definitely not what I was expecting. I mean, the BioPerl docs aren’t exactly the face of beauty themselves, but this just seemed…elementary to me.

I asked the BioPerl mailing list what tool they used for their documentation, and they referred me to the Pdoc package. Downloading the package from their site and installing it on my MacBook was pretty simple.

cd /usr/local/
sudo mv ~/Downloads/pdoc-1.1.tar.gz .
sudo tar xzf pdoc-1.1.tar.gz
cd pdoc-1.1
sudo perl Makefile.PL
sudo make
sudo make install

Running the tool was pretty simple as well. It has many command-line options for customization, but I used the bare minimum: the -source option for indicating the directory where my module was located and the -target option for indicating where to write generate output files.

cd
mkdir html_docs_temp
perl /usr/local/pdoc-1.1/scripts/perlmod2www.pl -source ~/interval_tree_temp -target ~/html_docs_temp

This created an entire documentation browser at ~/html_docs_temp/index.html. It definitely looks much better than the previous example, and while the browser idea is nice (I will be using it in the very near future), all I really needed for this simple example was the IntervalTree.html and the perl.css file. Here are the results (source here).

Retrieving split subsequences with BioPerl

Whenever I need to store the positions of genes and other features of a genomic sequence, I typically keep them in a separate GFF3 file. Then, when I need to retrieve the genomic sub-sequence corresponding to a given feature, I simply grab that feature’s start and end positions from the GFF3 file, grab the genomic sequence from a Fasta file, and then pull out the sub-sequence of the feature using Perl’s substr function or BioPerl’s subseq function (from the Bio::PrimarySeq module). Pretty simple.

However, it requires a bit more work to retrieve split features (such as CDS fragments separated by introns). My typical approach here has been to grab each fragment separately, concatenate them, and then look at the feature’s strand to determine whether I needed to compute the reverse complement of the sequence I had just put together. Still not too difficult, but a lot of places where you can make simple silly mistakes.

A few days ago, I discovered BioPerl’s Bio::Location::Split module which really simplifies this process. The module allows you to create an object to store the start and end positions of every fragment belonging to a feature that has been split up across a genomic sequence. Then, instead of grabbing each individual fragment and concatenating them together, you can simply pass the Bio::Location::Split object as the argument in the subseq method to retrieve the sequence.

my $loc = Bio::Location::Split->new();
foreach my $cds_fragment(@cds_features)
{
  $loc->add_sub_Location( $cds_fragment->location );
}
my $cds_sequence = $genomic_sequence->subseq($loc);

This nice discovery has made my code much more simple, readable, and maintainable.

Random sequence generator

One way to test the probability of any given feature of a genomic sequence is to determine how often that feature occurs in sequences randomly sampled from some given “sequence space.” Determining an appropriate sequence space from which to sample isn’t a trivial task, but a common approach is to match the nucleotide composition of a given sequence (or set of sequences). Alternatively, one can use a model in which the probability of seeing a specific nucleotide at a specific position is dependent on one or more preceding nucleotides. This statistical model is called a Markov model, and it has applications to many areas of research.

I wrote a random sequence generator that uses Markov models to generate any number of random sequences. The program takes a model file (containing initial state probabilities and transition probabilities) as input and generates random sequences (the user can control the number of sequences and the length of each sequence). I also wrote a program to train a kth-order Markov model given one or more sequences (in Fasta format). The output of this program can then be used as input for the random sequence generator. Source code for these programs can be found at https://github.com/standage/Statistics-568/tree/master/sequence_spacer.

In my genome informatics class, we discussed the concept of waiting time–how long you expect to proceed along a sequence until you find a given feature. We derived an analytical solution for the expected waiting time for the feature “ATG” in a sequence where As, Cs, Gs, and Ts are evenly distributed. I then used my random sequence generator to create 5000 random sequences (each of length 1000) whose expected base composition was equal for As, Cs, Gs, and Ts. I then determined the position of the first occurrence of “ATG” in each random sequence. I found that the average waiting time was 63.41, very close to the analytical solution of 64. This is just one example of how computation and simulation can help address an important biological question. This type of approach is especially useful when analytical solutions are not readily available.

Start codon waiting time

Code golf: Morse code translator

I don’t think Morse code has any application in bioinformatics, but I though I would share this anyway.

The private beta for StackExchange’s Code Golf & Programming Puzzles Q&A site just launched yesterday and, having committed to this forum previously, I was invited to participate (I actually posted the first answer on the site!). The idea behind code golf is to try to solve a programming problem using as little code as possible. I think this is a great way to keep programming and analytical thinking skills sharp, whatever your work or research interests are. I must admit that I’m more interested in solving puzzles than I am about getting the shortest code possible.

Anyway, one of the puzzle challenges posted today was to write code to convert alphanumeric characters to Morse code. Here is my solution in Perl. I’ve added line breaks so it will fit in the page, but it’s intended to be all on one line.

%c=("A"=>".-","B"=>"-...","C"=>"-.-.","D"=>"-..","E"=>".","F"=>"..-.","G"=>"--.",
"H"=>"....","I"=>"..","J"=>".---","K"=>"-.-","L"=>".-..","M"=>"--","N"=>"-.",
"O"=>"---","P"=>".--.","Q"=>"--.-","R"=>".-.","S"=>"...","T"=>"-","U"=>"..-",
"V"=>"...-","W"=>".--","X"=>"-..-","Y"=>"-.--","Z"=>"--..",1=>".----",2=>"..---",
3=>"...--",4=>"..---",5=>".....",6=>"-....",7=>"--...",8=>"---..",9=>"----.",
0=>"-----");while(<>){foreach(split(//)){if(exists($c{$_})){
printf"%s ",$c{$_}}else{print"$_"}}}

This can be executed directly from the command line without having to create any source code files.

$ echo 'HERE IS SOME ENGLISH' | perl -e '%c=("A"=>".-","B"=>"-...","C"=>"-.-.","D"=>"-..","E"=>".","F"=>"..-.","G"=>"--.","H"=>"....","I"=>"..","J"=>".---","K"=>"-.-","L"=>".-..","M"=>"--","N"=>"-.","O"=>"---","P"=>".--.","Q"=>"--.-","R"=>".-.","S"=>"...","T"=>"-","U"=>"..-","V"=>"...-","W"=>".--","X"=>"-..-","Y"=>"-.--","Z"=>"--..",1=>".----",2=>"..---",3=>"...--",4=>"..---",5=>".....",6=>"-....",7=>"--...",8=>"---..",9=>"----.",0=>"-----");while(<>){foreach(split(//)){if(exists($c{$_})){printf"%s ",$c{$_}}else{print"$_"}}}'
.... . .-. .  .. ...  ... --- -- .  . -. --. .-.. .. ... ....