Yesterday’s XCKD comic could not have been more timely. This week I am trying to gather whole-genome annotations for a variety of model organisms–well, I am in fact gathering two sets of annotations for each organism (for comparison). My real trouble hasn’t been downloading the data to my local machine (although navigating a smorgasbord of genome browsers and FTP sites has been “fun”). My real trouble begins once I have the data in hand.
BED, GTF, GFF2, GFF3, XML; all too loosely defined (or too loosely adhered to) to enable any kind of reliable conversion utilities. So I’m stuck searching for conversion scripts on Google, hoping I find one that works for my particular data set…until I throw my hands up in the air and consign myself to writing yet another Perl script that will take 5 minutes to code and 2 hours to debug.
I’m glad I’m not the only one that feels this way. Take a look at the top answer to this thread in a bioinformatics Q&A forum.
At times I’ve felt that, given enough time, I could come up with a solution that suits everybody’s needs. But then that just puts us right back to where we started (refer again to the XKCD comic). The problem isn’t formats, the problem is people.