The world's simplest HMM-based genefinder

The base composition of most genome sequences is not homogeneous. In particular, GC composition can vary regionally. In the human genome, for instance, there are "CpG islands" which show a very strong statistical signal (even stronger at the dinucleotide composition level than the mononucleotide composition level) and which tend to mark the 5' end of many genes. Therefore the problem of objectively segmenting a genome sequence into regions of different compositions arises naturally. One way to segment a genome is with HMM algorithms.

Let's assume that a genome can be modeled as a two-state HMM, with two states A and B. Assume that the parameters of this HMM are as follows. State A has an AT-biased emission distribution {0.35, 0.15, 0.15, 0.35} for the probabilities of {A,C,G,T}. State B has a GC-biased emission distribution {0.15, 0.35, 0.35, 0.15}. State A switches to state B with probability 0.001. State B switches back to state A with probability 0.01. Assume that the initial distribution for the HMM is uniform; 0.5, 0.5 for the two states. (The file example.hmm contains these parameters; a C module for a simple, general HMM structure and for reading this file format is provided in hmm.c and hmm.h.)

The file example.fa contains a simulated 1 Mb genome sequence, generated by this HMM. Lower case residues were generated by state A; upper case residues by state B.

Viterbi parsing

Implement a Viterbi parser for the HMM above, including the initialization, matrix fill, and traceback stages. Your program should output the alignment as a series of segments, e.g.:
    1    153    state A
    154  252    state B
    253  1651   state A
    (... etc...)
(If you do the traceback backwards and your segments are ordered from the end of the genome to the start, you can pipe the output through UNIX "sort" to get the right order; don't worry about resorting them inside your C program.)

Run your program on the 1 Mb simulated sequence in example.fa. How many segments of the genome are in state B? (The true state path is in the file example.positions.)

Posterior decoding

Implement a posterior decoding parser for the HMM above, including forward, backward, and the forward/backward calculation to obtain the probability that each residue i in the sequence is in state A versus state B. Find all segments of length >=50 for which the posterior probability of state B is >= 0.5 for each residue in the segment, and print them out.

Expectation/maximization

Using your forward and backward implementations (above), implement an expectation maximization algorithm to estimate the parameters of an HMM from randomized starting parameters. (You can assume that the initial distribution \pi is still uniform - you only have one starting point, so it's not really useful to try to estimate the initial distribution from the data. Just estimate the transition and emission probabilities of the two-state HMM.) Run your EM algorithm on the simulated genome in example.fa. Does your algorithm find the same parameters that actually generated the genome sequence?

Beware: the EM algorithm can be computationally intensive. My version of the code (unoptimized) takes about 5-10 minutes to run this problem, and consumes 25 MB of RAM. You might want to watch your program with the "top" command to make sure it doesn't accidentally consume too much memory.

A real application: identifying structural RNA genes in Methanococcus

The Methanococcus jannaschii genome has an interesting property. On average, the base composition of the genome is strongly AT-biased. Some of its genes, though -- the structural RNA genes -- are strongly GC-biased. It turns out that the GC composition of bulk genomes appears to vary freely, but the GC composition of structural RNA genes is strongly correlated with the normal growth temperature of the organism (we think this is because structural RNAs have to maintain a base paired secondary structure; thus, at high temperature, they drive towards high GC because folded high GC sequences have higher thermostability).

The parameters above (and in the example.hmm) are pretty close to what you get from estimating a two-state HMM on M. jannaschii bulk genome and structural RNA genes, with state A as the bulk genome and state B as the RNA genes.

A slightly modified copy of the complete M. jannaschii genome is [here]. (I modified it by changing all the ambiguous nucleotides to A's, so you wouldn't have to deal with them in your programs.)

Material and files that might be of use

Feel free to use any of these C modules to save yourself some work: