Command-line magic for your gene annotations

In the GFF3 specification, Lincoln Stein claims that the primary reason tab-delimited annotation formats have persisted for so many years (despite the existence a variety of worthy alternative formats) is the ease with which tab-delimited data can be edited and processed on the command line. In this post I wanted to see what kinds of information are easily extracted from a GFF3 file using basic command line tools (i.e. no reusable parsers as provided by BioPerl, BioPython, et al).

Determine all annotated sequences

The GFF3 spec includes the ##sequence-region pragma whose purpose is to specify boundary coordinates for each annotated sequence in the file. However, I see this pragma neglected more often than not, both by data files and by tools. Plus, there is no strict requirement that the file must include features corresponding to each ##sequence-region entry. So if you want to determine which sequences for which a given GFF3 really provides annotations, it’s best to look instead at the first column of each feature.

Here is a command that will print out all of the sequences annotated by a given GFF3 file. A detailed explanation is provided below.

cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq -c | sort -rn | head
  • The first cut command will attempt to extract and print the 1st and 9th column of each line in the file. We ultimately don’t need the 9th column, but it’s helpful at this point since it allows us to distinguish our entries of interest (features) from other entries (directives, pragmas, comments, and sequence data). This command will output two fields for each feature, while it will only output a single field for all the other entry types. This enables filtering out of non-features with subsequent commands.
  • The grep command will check each line of the previous command to see whether it contains a tab character (separating two fields). Only lines that contain a tab are printed. The output of this command is all of the features from the GFF3 file.
  • The second cut command will cut out the first field from each feature and ignore the other, since we weren’t really interested in it in the first place. The output of this command is the sequence corresponding to each feature.
  • The first sort command will (surprise!) sort the output of the previous command.
  • The uniq command will collapse identical adjacent lines and print out the number of times each line is seen. The output of this command is what we’re looking for and really we could stop here. The next two commands are simply for convenience.
  • The second sort command will sort the sequence IDs according to the number of corresponding annotated features.
  • The head command will simply ensure that you only see the first 10 lines of output (useful if your GFF3 file includes thousands of sequences, such as scaffolds from a draft assembly).

Determine annotation types

Perhaps the most common use for the GFF3 format is encoding protein-coding gene structures. However, the format is very flexible and can leverage the Sequence Ontology to provide annotations for just about anything you could imagine related to biological sequences. So when working with annotations from an external source, it’s always a good idea to verify your assumptions about precisely what type of annotations are provided.

Here is a command that will print out all of the feature types contained in a given GFF3 file, along with a detailed explanation.

grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq -c | sort -rn | head
  • We’re interested only in features, so the grep command will print out all lines that do not start with a pound symbol.
  • The cut command will cut out the third column of each feature and print it. This command will automatically ignore any unwanted data that may not have been filtered out by the previous command (such as sequence data), since those data do not contain tabs.
  • The purpose of the remaining commands is identical to the previous example: sort the output, collapse adjacent lines that are identical, and print out each line and the number of times it occurs. Again the last two command are provided only for convenience so that your terminal is not flooded with data.

Determine the number of genes

The previous command gives the number of occurrences of each feature type in the GFF3 file. However, if you are only interested in the number of occurrences of a single feature type (say, genes), then there is a much simpler solution.

grep -c $'\tgene\t' yourannots.gff3

Replace gene with your desired feature type and that’s that!

Extract all gene IDs

The previous example can easily be extended to pull out all the IDs for a give feature type. Here is a command that will print out all gene IDS for a given GFF3 file, with a corresponding explanation. This command is easily extended to other feature types.

grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)' | head
  • As in the previous example, the grep command will print only lines that contain gene features.
  • The Perl one-liner uses a regular expression to identify the ID attribute and print it.
  • The head command is only for convenience and simply makes sure you don’t flood your terminal with IDs.

Print length of each gene

The previous examples can also be extended quite easily to print out the length of all features of a given type. This is very helpful if you want to analyze and/or visualize a distribution of feature lengths. Here is a command that will print out the length of each gene in a given GFF3 file. Again, this is easily adapted to other feature types.

grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | /
                                   perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)' | /
                                   sort -rn | head
  • As in the previous example, the grep command will print only lines that contain gene features.
  • The cut command will extract and print the 4th and 5th field of each feature, which corresponds to its start and end coordinates.
  • The Perl one-liner will print out the length of each feature given the coordinates.
  • As in previous examples, the remaining commands are solely for aesthetic convenience.

This command would typically be entered on a single line in your terminal. I only broke it up into multiple lines for readability.

Example

I downloaded gene annotations from TAIR to verify that these commands work as expected. Below is my terminal output.

standage@iMint ~ $ wget ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
--2012-08-10 15:16:45--  ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
           => `TAIR10_GFF3_genes.gff'
Resolving ftp.arabidopsis.org (ftp.arabidopsis.org)... 171.66.71.56
Connecting to ftp.arabidopsis.org (ftp.arabidopsis.org)|171.66.71.56|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Genes/TAIR10_genome_release/TAIR10_gff3 ... done.
==> SIZE TAIR10_GFF3_genes.gff ... 44139005
==> PASV ... done.    ==> RETR TAIR10_GFF3_genes.gff ... done.
Length: 44139005 (42M) (unauthoritative)

100%[=====================================================================================================================>] 44,139,005  6.67M/s   in 7.2s    

2012-08-10 15:16:54 (5.87 MB/s) - `TAIR10_GFF3_genes.gff' saved [44139005]

standage@iMint ~ $ cut -s -f 1,9 TAIR10_GFF3_genes.gff | grep $'\t' | cut -f 1 | sort | uniq -c | sort -rn | head
 157712 Chr1
 135017 Chr5
 113968 Chr3
  91857 Chr2
  90371 Chr4
    723 ChrM
    616 ChrC
standage@iMint ~ $ grep -v '^#' TAIR10_GFF3_genes.gff | cut -s -f 3 | sort | uniq -c | sort -rn | head
 215909 exon
 197160 CDS
  35386 protein
  35386 mRNA
  34621 five_prime_UTR
  30634 three_prime_UTR
  28775 gene
   3911 mRNA_TE_gene
   3903 transposable_element_gene
   1274 pseudogenic_exon
standage@iMint ~ $ grep -c $'\tgene\t' TAIR10_GFF3_genes.gff
28775
standage@iMint ~ $ grep $'\tgene\t' TAIR10_GFF3_genes.gff | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)' | head
AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01046
AT1G01050
AT1G01060
AT1G01070
AT1G01073
AT1G01080
standage@iMint ~ $ grep $'\tgene\t' TAIR10_GFF3_genes.gff | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)' | sort -rn | head
31258
26435
25965
23544
19753
19352
18492
18184
17943
17555
standage@iMint ~ $
Advertisements

2 comments

  1. Pingback: stephenturner/oneliners · GitHub | GenomicsNX - Next Generation Sequencing Knowledge-Based
  2. trepanger

    Here is a much simpler way to print gene lengths:

    awk ‘$3 == “gene” {print $5 – $4}’ a.gff3 | sort -rn | head

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s