Category: PHP

PHP one-liners

One of the big turning points in the development of my programming and data processing skills was embracing Perl not only as a language for writing scripts but also as a shell tool for basic processing tasks. For a long time I had basic familiarity with common shell tools such as grep, sort, cut, and so on, and had learned to do basic tasks by piping them together. I had also learned the power of writing short bits of Perl code directly on the command line and running it with perl -e or perl -ne. But as I’ve gained more experience with shell tools (and hey, even more experience with Perl), my default approach to data processing has changed. For many line-by-line processing tasks, stitching Perl together with other shell tools is often quicker and more concise than firing up a text editor and writing the whole processing procedure in Perl.

I’m a huge proponent of using the right tool for the right job. Some languages are designed and optimized for certain types of tasks, and of course a programmer’s experience and preference influence what “the right tool” for a particular job is. My first programming experience was with PHP, and I occasionally find myself “thinking in PHP” while I’m actually writing code in other languages, or wishing I could leverage one of PHP’s built-in functions without implementing the entire script in PHP.

Well, I came across this underwhelming post the other day. It looks like php -r is all that has been missing in my life…


Converting Entrez XML to GFF3

This last week represents the second time I have tried to access gene annotations for a particular genome of interest, but quickly lost interest since the data were only available in complicated, extremely verbose, and poorly documented XML or ASN.1 formats. Luckily, the first time this happened (a few months ago), I tried hopping on NCBI’s FTP site and was able to find annotations in a GFF-like tab-delimited format, which I was able to easily convert to GFF3. This last week, however, NCBI’s FTP site (and their help desk, for that matter) were no help for finding usable gene annotations in a tab-delimited format.

I finally decided to buck up, bite the bullet, and write a conversion script myself (if these conversion scripts counted toward advancement, I’d have tenure by now). Most of my experience with XML comes from my web development days where I primarily used PHP’s SimpleXML library for parsing and processing XML data. I’m sure Perl and C (and probably all the other common languages) have XML-processing libraries that are just fine, but I decided to implement this script in PHP so that I could work in a familiar environment and complete the task as quickly as possible so I can get back to my research.

Anyway, below is the latest draft of the conversion script I implemented.

#!/usr/bin/env php

This program takes a single argument on the command line: the path of a file containing Entrez XML-formatted gene annotations.
GFF3-formatted annotations are printed to STDOUT.

ini_set("memory_limit", -1);
assert_options(ASSERT_BAIL, false);
assert_options(ASSERT_WARNING, false);
$strands = array("plus" => "+", "minus" => "-");
$encode_search = array(';', '=', '%', '&', ',');
$encode_replace = array('%3B', '%3D', '%25', '%26', '%2C');

$xmlfile = $argv[1];
$xmldata = simplexml_load_file($xmlfile);

$genes = $xmldata->xpath('/Entrezgene-Set/Entrezgene');

function assertordie($condition, $message)
  assert($condition) or fprintf(STDERR, "Assert error: $message\n") and die();

foreach($genes as $gene)
  // Gene feature data
  $gene_ui   = $gene->{'Entrezgene_gene-source'}->{'Gene-source'}->{'Gene-source_src-int'};
  $gene_acc  = $gene->{'Entrezgene_gene'}->{'Gene-ref'}->{'Gene-ref_locus'};
  $gene_acc = str_replace($encode_search, $encode_replace, $gene_acc);
  $gene_desc = $gene->{'Entrezgene_gene'}->{'Gene-ref'}->{'Gene-ref_desc'};
  $gene_desc = str_replace($encode_search, $encode_replace, $gene_desc);
  $gene_comm = $gene->xpath('Entrezgene_locus/Gene-commentary');
  $gene_comm_count = 0;
  if(sizeof($gene_comm) > 1)
    fprintf(STDERR, "Warning: assuming that locus '%s (%s)' contains %d genes\n", $gene_acc, $gene_ui, sizeof($gene_comm));

  foreach($gene_comm as $comm)
    $comm_ui = $gene_ui;
    if(sizeof($gene_comm) > 1)
      $comm_ui = sprintf("%s.g%d", $gene_ui, $gene_comm_count);
    $gene_seq = $comm->{'Gene-commentary_accession'};
    $gene_intervals = $comm->xpath('Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval');
    assertordie(sizeof($gene_intervals) == 0 or sizeof($gene_intervals) == 1, sprintf("number of intervals for gene '%s (%s)': expected=%s, actual=%d", $gene_acc, $comm_ui, "[0,1]", sizeof($gene_intervals)));
    if(sizeof($gene_intervals) == 0)
      fprintf(STDERR, "Warning: gene '%s (%s)' contains no genomic intervals, assuming it's a deprecated gene, skipping\n", $gene_acc, $comm_ui);

    $gene_products = $comm->xpath('Gene-commentary_products/Gene-commentary[Gene-commentary_type/@value="mRNA"]');
    if(sizeof($gene_products) < 1)
      fprintf(STDERR, "Warning: gene '%s (%s)' contains no mRNA products, assuming it's a duplicates gene, skipping\n", $gene_acc, $comm_ui);
    $strand_attributes = $gene_intervals[0]->{"Seq-interval_strand"}->{"Na-strand"}->attributes();
    $gstrand = $strands[(string)$strand_attributes["value"]];
    $gattributes = sprintf('ID=%s;Name=%s;Note="%s"', $comm_ui, $gene_acc, $gene_desc);
    $gstart = (int)$gene_intervals[0]->{"Seq-interval_from"} + 1;
    $gend   = (int)$gene_intervals[0]->{"Seq-interval_to"} + 1;
    printf("%s\t%s\tgene\t%d\t%d\t.\t%s\t.\t%s\n", $gene_seq, "Entrez", $gstart, $gend, $gstrand, $gattributes);

    // mRNA feature data
    foreach($gene_products as $mrna)
      // mRNA and exon features
      $tui = $mrna->{"Gene-commentary_accession"};
      $exons = $mrna->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_mix/Seq-loc-mix/Seq-loc/Seq-loc_int/Seq-interval');
      if(sizeof($exons) == 0)
        $exons = $mrna->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_int/Seq-interval');
        assertordie(sizeof($exons ==1), sprintf("number of exons for transcript '%s': expected=%d, actual=%d", $tui, 1, sizeof($exons)));
      $tattributes = sprintf('ID=%s;Parent=%s', $tui, $comm_ui);

      $tcoords = array();
      foreach($exons as $exon)
        $tcoords[] = $exon->{"Seq-interval_from"};
        $tcoords[] = $exon->{"Seq-interval_to"};
      $tstart = min($tcoords) + 1;
      $tend   = max($tcoords) + 1;

      // protein and CDS features
      $transcript_products = $mrna->xpath('Gene-commentary_products/Gene-commentary[Gene-commentary_type/@value="peptide"]');
      assertordie(sizeof($transcript_products == 1), sprintf("number of products for transcript '%s': expected=%d, actual=%d", $tui, 1, sizeof($transcript_products)));
      $protein = $transcript_products[0];
      $pui = $protein->{"Gene-commentary_accession"};
      $cds_segments = $protein->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_mix/Seq-loc-mix/Seq-loc/Seq-loc_int/Seq-interval');
      if(sizeof($cds_segments) == 0)
        $cds_segments = $protein->xpath('Gene-commentary_genomic-coords/Seq-loc/Seq-loc_int/Seq-interval');      
      $pcoords = array();
      foreach($cds_segments as $cds)
        $pcoords[] = $cds->{"Seq-interval_from"};
        $pcoords[] = $cds->{"Seq-interval_to"};
      $pstart = min($pcoords) + 1;
      $pend   = max($pcoords) + 1;
      $pattributes = sprintf("ID=%s;Parent=%s", $pui, $tui);

      // Print out all features
      printf("%s\t%s\tmRNA\t%d\t%d\t.\t%s\t.\t%s\n", $gene_seq, "Entrez", $tstart, $tend, $gstrand, $tattributes);
      printf("%s\t%s\tprotein\t%d\t%d\t.\t%s\t.\t%s\n", $gene_seq, "Entrez", $pstart, $pend, $gstrand, $pattributes);
      foreach($exons as $exon)
        $estart = (int)$exon->{"Seq-interval_from"} + 1;
        $eend   = (int)$exon->{"Seq-interval_to"} + 1;
        printf("%s\t%s\texon\t%d\t%d\t.\t%s\t.\tParent=%s\n", $gene_seq, "Entrez", $estart, $eend, $gstrand, $tui);
      foreach($cds_segments as $cds)
        $cstart = (int)$cds->{"Seq-interval_from"} + 1;
        $cend   = (int)$cds->{"Seq-interval_to"} + 1;
        printf("%s\t%s\tCDS\t%d\t%d\t.\t%s\t.\tParent=%s\n", $gene_seq, "Entrez", $cstart, $cend, $gstrand, $tui);

Fixing my BioStar flair

Update: As of April 2012, BioStar has recently been migrated and their flair functionality is not yet re-implemented. I have also migrated my work server once and my blog twice since I posted this. This post did not handle these migrations well, but I’m just going to leave it as-is for now.

A few days ago I decided to put some StackExchange flair on my CV. I have nearly as much reputation on BioStar as I do on all the other StackExchange sites combined, but since Biostar was never fully integrated with the StackExchange 2.0 network, I have to link to my BioStar flair separately. Oh well.

I copied the flair HTML from BioStar and pasted on my site and I got this. For some reason, the favicon in the flair has been replaced with the text SE, and I think it looks pretty lame.

insert old flair here

I looked at the source code for the flair and found that the SE text was an image ( I dug around a bit, and found the favicon file I was looking for (conveniently it was So I went ahead and created a simple PHP script on my server that grabs the BioStar flair, changes the image reference, and prints out the HTML. If I link to my new and improved flair, this is the result.

insert improved flair here


$html = file_get_contents("". $_GET['id'] .".html?theme=default");
$html = str_replace("theme.favicon.flair.0", "theme.favicon.flair", $html);
echo $html;


This solved one issue, but unfortunately it created another. Since the images were different sizes, the new icon doesn’t line up well with my name. Using a browser plugin to play with the CSS, I was able to fix things with this change.

/* I replaced this...
.valuable-flair .userInfo .username img{border:none; padding-right:3px;}

...with this */
.valuable-flair .userInfo .username img{border:none; margin-bottom: -4px; padding-right:3px;}

Rather than continuing my current approach of dynamically editing markup directly from BioStar, I decided to make static copies of the HTML and CSS on my server and manually make the necessary edits. Note that I don’t mean static in every sense of the word, since the HTML still uses JSON to populate the flair data from BioStar and this request is dynamically built using PHP. Anyway, here is the result!

insert new flair here

Any BioStar user can get their flair by including the following HTML, replacing my ID with theirs in the URL parameters.

< iframe src=""
        marginwidth="0" marginheight="0" frameborder="0" 
        scrolling="no" width="210px" height="60px"></iframe>

I’ve included the source of the final page below.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">
<html xmlns="">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />   
    <link rel="stylesheet" type="text/css" href="biostar.css" />
    <div class="valuable-flair">
      <div id="gravatar" class="gravatar"></div>
      <div class="userInfo">
        <span class="username">
          <img src="" />
          <a id="profileurl" class="user-link" target="_blank" title="Visit my profile on">
            <span id="displayname"></span>
        </span><br />
        <span id="reputation" class="reputation-score" title="reputation score"></span><br />
        <div id="badges"></div>
    <script type="text/javascript" src=""></script>
    <script type="text/javascript">
      $().ready(function() {
        $.getJSON("<?php echo $_GET['id'] ?>.json?callback=?", flairCallback);

      function flairCallback(data) {
        $("#profileurl").attr("href", data.profileUrl);