In biology, the term locus is pretty loosely defined. In the most general sense, it refers to a specific position on a chromosome or other large sequence of genomic DNA. A locus can refer to a single nucleotide, or it can even refer to a large genomic neighborhood. The most common interpretation of the term locus is the location of a gene on the chromosome.

Recently I’ve been working a lot with gene structure annotation: specifically, comparing one set of annotations to another set. For this work, I needed a much more precise definition for a locus. If one set of annotations has overlapping gene models, should I treat each gene model as a separate locus or should I include them together in an aggregate locus? When comparing one set of annotations to the other, should I use gene models from one set to determine the “true” loci, or should I include gene models from both sets in this process?

Given two sets of gene annotations, the simplest approach to determine loci is probably to treat one set as the reference and to say that each gene in that set corresponds to a distinct locus. However, this approach requires us to compare annotations not only at each locus but also in the “intergenic” space between loci, since gene annotations from the second set will not line up perfectly with the gene annotations from the first (reference) set.

After pondering this issue for a while, I decided it makes much more sense to use both sets of gene models simultaneously to determine the loci. The provides several benefits, including the fact that we make no assumptions as to which set of annotations is higher quality (which is more likely to be representative of the “true” loci), and the fact that we do not need to analyze the (possibly) large intergenic spaces between loci (since every gene corresponds to a locus). With these considerations, I currently use the following precise definition of locus when working with gene structure annotations.

Given a set of gene structure annotations, a locus is a maximal region of the genomic sequence containing either a single gene annotation or a set of gene annotations in which every gene in the locus overlaps with at least one other gene in the locus.


