A plug for ontologies

An ontology is a formal (mathematical) representation of knowledge for a particular information domain: the types of objects that exist in that domain, and the relationships between those objects. Formally organizing terms using a controlled vocabulary and taxonomy makes it possible for us to exchange and analyze information in that domain in a way that computers can understand.

So what does this have to do with biology? My writing of this note was prompted by two things: first, I found this abstract in my RSS feed earlier this week, and then I saw this question on a bioinformatics Q&A site of which I’m a regular participant. Ontologies seem to be gaining substantial traction in biology and bioinformatics.

Anyone who has not been living under a rock for the last several decades knows that the life sciences are increasingly becoming an information- and data-driven science. This is certainly the case in such fields as genomics, where working with gigabytes of raw high-throughput sequence data is routine. But even in other branches of biology (traditionally characterized by tedious and complicated assays designed to tease out small, precious nuggets of truth), advances in technology and the accumulation of decades of research is putting huge amounts of information at the fingertips of to practically any scientist that cares to use it. There is no way to take advantage of these data without computers, but computers will not be much help unless we can formulate our biological problems in formal (mathematical) terms.

Perhaps the most popular and successful ontologies in the life sciences are the Gene Ontology and the Sequence Ontology. I’ve worked quite a bit with the Sequence Ontology, which originated from a need to intelligently exchange annotations describing biological sequences. You can imagine how difficult it could be to share information between different genome sequencing and annotation projects if each used different conventions for describing the data. For example, what project A calls a “transcript” is called an “mRNA” by project B, which is called just “RNA” by project C. Of course, any scientist that takes a moment to look at this can figure it out quickly, but we cannot expect a computer to understand this relationship unless we tell it. The Sequence Ontology provides a controlled vocabulary for describing biological sequences, as well as relationships between terms (e.g., a “coding sequence” is associated with an “mRNA”, which is associated with a “gene”).

Biological ontologies are not limited to describing genes and sequences, however. The OBO Foundry supports a handful of ontologies for various subdomains of biological inquiry, encompassing everything from the finest molecular details to course-grained systematics and anatomy. OBO also has a list of dozens of other biological and biomedical ontologies that are either candidates for OBO support or that are externally developed and simply recognized by OBO.

As biology continues to become an information science, ontologies are going to play an increasingly critical role in our representation, exchange, and analysis of biological data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s