Distributing points uniformly on a sphere

Earlier this semester I had a project where I had to calculate the solvent-accessible surface area of a protein given its 3-dimensional structure (in a PDB file). The suggested method of implementation was to simulate the surface of each atom by distributing 500 points (roughly) evenly across the surface defined by its van der Waals radius. Although that sounds straightforward, it actually took me a while to figure out how to randomly sample points from a sphere defined by a point and a radius.

I did some extensive searching on Google, but most of the methods I found came with the caveat that there would be a greater concentration of points near the poles than at the equator. I finally came across this page, which offered a clear and concise method for sampling points from a sphere.

• Set up a 2-dimensional coordinate system: the z dimension (running from -R to R, where R is the van der Waals radius) and $\phi$ (longitude, running from 0 to $2\pi$ radians)
• Generate random coordinates: $z_i \sim \text{Uniform}(-R, R)$ and $\phi_i \sim \text{Uniform}(0, 2\pi)$
• Use the following relation ( $z = R \sin{\theta}$) to calculate latitude ( $\theta$): $\theta = \sin^{-1}(\frac{z}{R})$
• Finally, convert to Cartesian (xyz) coordinates
• $x = R \cos{\theta}\cos{\phi}$
• $y = R \cos{\theta}\sin{\phi}$
• $z = R \sin{\theta} = \text{(*gasp!*)} z$

Repeat this process 500 times for each atom to simulate its surface!

Rodrigues’ rotation formula

I’m taking a structural informatics course this semester, and this is my first in-depth exposure to working with biomolecular structures. The class’ first assignment was to download a specific PDB file, parse the atom coordinates, and determine all sorts of bond lengths, bond angles, and torsion angles. This required me to review concepts I haven’t used much since linear algebra and multivariable calculus, but all in all this part wasn’t too hard.

The last part of the assignment was a kicker though. I had to select the 30th residue in the protein, set two of its dihredral angles to 0º (effectively rotating the remaining portion of the structure along bonds between backbone atoms), and recompute the new atomic coordinates. I had no idea where to start!

One of the other student’s in the class mentioned Rodrigues’ rotation formula, and after looking into things, it seemed to be the answer to our question. In a general sense, if you want to rotate given a vector $\vec{v}$ by an angle of $\theta$ degrees about an axis of rotation defined by the vector $\vec{k}$, then the new rotated vector can be computed as follows. $\vec{v}_{\text{rot}} = \vec{v} \cos{\theta} + (\vec{k} \times \vec{v})\sin{\theta} + \vec{k}(\vec{k} \cdot \vec{v})(1 - \cos{\theta})$

In the context of the homework assignment, $\vec{k}$ is the bond around which I’m rotating (N-CA for $\phi$, CA-C for $\psi$), $\theta$ is the angle I need to set to 0 ( $\phi$ or $\psi$), and the rotation is applied to the coordinate vector $\vec{v}$ for each subsequent atom affected by the rotation.

F1 score and gene annotation comparisons

Burset and Guigó published a foundational paper on evaluating gene annotations in 1996. A lot of my work as a graduate student has involved writing software for comparing multiple sets of gene structure annotations against each other, and I’ve used the statistics described in this paper (matching coefficient, correlation coefficient, sensitivity, specificity) as the basis for my comparisons. However, there is another statistic (called the F1 score) that is (apparently) used commonly for analyzing gene annotations, and someone recently recommended that I should include this statistic in my comparisons. Never having heard of it, I decided to investigate.

I found several papers that referenced the F1 score (none of them were related to gene structure annotation, by the way) and was able to eventually track down the origin of the statistic. It was introduced by van Rijsbergen in 1979 in a text on information retrieval and has since found application in a variety of fields. The F1 score combines two other commonly used statistics: the precision $P$ (defined as the ratio of true positives to all predicted positives) and the recall $R$ (defined as the ratio of true positives to all actual positives).

With precision defined as $P = \frac{TP}{TP + FP}$

and recall defined as $R = \frac{TP}{TP + FN}$

we now define the F1 score as follows. $F1 = \frac{2PR}{P + R}$

After 15 minutes of searching, 10 minutes of reading, and 5 minutes of coding, my software now has a new feature!

Notation for HMM algorithms

I was exposed briefly to Hidden Markov Models (HMMs) as an undergraduate, but now as a graduate student I have revisited HMMs in much more detail (I need to know it “forward” and “backward”, if you catch my drift). Our instructor taught us with very simple examples (two hidden states, very short sequences) and in each case, the notation he used was specific to each example. This made it easy to grasp the concepts initially, but I’m not sure I would have been able to implement more complicated HMMs with only that notation to help me. In preparing for a recent exam, however, I’ve finally come across some notation that is more generalizable. I’ll use it here to describe the forward, backward, and Viterbi algorithms.

Terminology

First let us provide some formal definitions. We have an observed sequence $O = (O_1 O_2 ... O_N)$ where $O_i \in A = \{ a_1 a_2 ... a_r \}$ for all $i \in 1,2,...,N$. For each $O_i$, there is an unobserved state $Q_i \in S = \{ S_1, S_2, ..., S_s \}$ forming the corresponding hidden state sequence $Q = (Q_1 Q_2 ... Q_N)$. Additionally we define initial state probabilities $\pi$ (where $\pi_{S_i}$ is the probability of starting a sequence in state $S_i$), transition probabilities $\tau$ (where $\tau_{S_i S_j}$ is the probability of moving from state $S_i$ to state $S_j$), and emission (or output) probabilities $\theta$ (where $\theta_{a_i | S_j}$ is the probability of observing the symbol $a_i$ given that we know the corresponding hidden state is $S_j$).

Forward algorithm

The forward algorithm provides the most straightforward way to determine the probability of an observed sequence given the HMM (i.e. $Prob\{O_1 O_2 ... O_N\}$). This algorithm introduces a probability $F_{i,j} = Prob\{O_1 O_2 ... O_i, Q_i = S_j\}$, which we can conceptualize as an $s \times N$ matrix. We initialize the first column of this matrix like so. $F_{1,j} = \pi_{S_j}\theta_{O_1 | S_j}$

We can then continue to fill out the rest of the matrix using the following recurrence. $F_{i,j} = \sum_{k=1}^{s} F_{i-1,k} \tau_{S_k S_j} \theta_{O_i | S_j}$

Once the matrix is filled out, we can get our desired probability $Prob\{O\} = Prob\{O_1 O_2 ... O_N\}$ by summing over the values in the last column of the matrix. $Prob\{O\} = \sum_{k=1}^{s} F_{N, k}$

Backward algorithm

The backward algorithm can also be used to find the probability of an observed sequence, but it is most useful in combination with the forward algorithm to determine the probability of a state at a specific position given an output sequence.

The backward algorithm introduces a probability $B_{i,j} = Prob\{O_{i+1} O_{i+2} ...O_N | Q_i = S_j\}$, which we can also conceptualize as an $s \times N$ matrix. We first initialize the last column of the matrix like so. $B_{N, j} = 1$

Then, as the algorithm’s name suggests, we work backwards to fill out the rest of the matrix using the following recurrence. $B_{i,j} = \sum_{k=1}^{s} B_{i+1,k}\tau_{S_j S_k}\theta_{O_{i+1}|S_k}$

Viterbi algorithm

The Viterbi algorithm is a slight variation of the forward algorithm that answers a slightly different question: what is the most probable hidden state sequence given an observed sequence (in formal terms, which hidden state sequence $Q$ maximizes the probability $Prob\{Q | O\}$)?

Here we introduce another probability which we can conceptualize as an $s \times N$ matrix. $V_{i,j} = \max_{Q_1...Q_i}Prob\{O_1 ... O_i, Q_1 ... Q_{i-1}, Q_i = S_j\}$

We initialize this matrix exactly the same way we do in the forward algorithm. $V_{1,j} = \pi_{S_j}\theta_{O_1 | S_j}$

We then continue to fill out the rest of the matrix using the following recurrence. $V_{i,j} = \theta_{O_i | S_j} \max_{k=\{1 ... s\}}V_{i-1,k}\tau_{S_k S_j}$

We also maintain a matrix of pointers $V'$ where $V'_{i,j}$ indicates the value of k which maximizes $V_{i-1,k}\tau_{S_k S_j}$ (used for backtracing). To find the most probable hidden state sequence, identify the highest value in the last column of the matrix (or $\max_{j=1 ... s} V_{N,j}$) and use the matrix of pointers $V'$ to trace backward and identify the sequence of states that produced the maximal probability.