# Random sequence generator

One way to test the probability of any given feature of a genomic sequence is to determine how often that feature occurs in sequences randomly sampled from some given “sequence space.” Determining an appropriate sequence space from which to sample isn’t a trivial task, but a common approach is to match the nucleotide composition of a given sequence (or set of sequences). Alternatively, one can use a model in which the probability of seeing a specific nucleotide at a specific position is dependent on one or more preceding nucleotides. This statistical model is called a Markov model, and it has applications to many areas of research.

I wrote a random sequence generator that uses Markov models to generate any number of random sequences. The program takes a model file (containing initial state probabilities and transition probabilities) as input and generates random sequences (the user can control the number of sequences and the length of each sequence). I also wrote a program to train a kth-order Markov model given one or more sequences (in Fasta format). The output of this program can then be used as input for the random sequence generator. Source code for these programs can be found at https://github.com/standage/Statistics-568/tree/master/sequence_spacer.

In my genome informatics class, we discussed the concept of waiting time–how long you expect to proceed along a sequence until you find a given feature. We derived an analytical solution for the expected waiting time for the feature “ATG” in a sequence where As, Cs, Gs, and Ts are evenly distributed. I then used my random sequence generator to create 5000 random sequences (each of length 1000) whose expected base composition was equal for As, Cs, Gs, and Ts. I then determined the position of the first occurrence of “ATG” in each random sequence. I found that the average waiting time was 63.41, very close to the analytical solution of 64. This is just one example of how computation and simulation can help address an important biological question. This type of approach is especially useful when analytical solutions are not readily available.