Genome sequences in Fasta format are strings of As, Cs, Gs, and Ts representing the sequence of nitrogenous bases along the chromosome. I’ve been working on a similar format recently for encoding gene structure along the chromosomes. A ‘G’ represents a nucleotide in an intergenic region, a ‘C’ represents a nucleotide in a coding region within a gene, an ‘I’ represents a nucleotide in an intron, and ‘F’ and ‘T’ represent nucleotides in 5′ and 3′ UTRs of a gene (respectively). Encoding a genomic sequence in this way doesn’t tell you anything about the nitrogenous bases at each position, but if you’re only interested in investigating gene structure, then this format can be quite handy. I’m calling these strings “model vectors” (vectors representing gene models) to differentiate from the standard Fasta format, but standard tools from any bioinformatics libraries (BioPerl, BioPython, etc) shouldn’t have any problem processing data in this format.
For sake of simplicity, I will offer a small, and consequently unrealistic, example. The model vector
TTTTTCCIICCCCIIICFFFFFFFF would be annotated something like this in GFF3 format.
##gff-version 3 chr vim gene 1 25 . - . ID=g1 chr vim mRNA 1 25 . - . ID=g1.t1;Parent=g1 chr vim three_prime_UTR 1 5 . - . ID=g1.t1.utr1;Parent=g1.t1 chr vim CDS 6 7 . - . ID=g1.t1.cds1;Parent=g1.t1 chr vim CDS 10 13 . - . ID=g1.t1.cds2;Parent=g1.t1 chr vim CDS 17 17 . - . ID=g1.t1.cds3;Parent=g1.t1 chr vim five_prime_UTR 18 25 . - . ID=g1.t1.utr2;Parent=g1.t1