The world's simplest HMM-based genefinder
The base composition of most genome sequences is not homogeneous. In
particular, GC composition can vary regionally. In the human genome,
for instance, there are "CpG islands" which show a very strong
statistical signal (even stronger at the dinucleotide composition
level than the mononucleotide composition level) and which tend to
mark the 5' end of many genes. Therefore the problem of objectively
segmenting a genome sequence into regions of different compositions
arises naturally. One way to segment a genome is with HMM algorithms.
Let's assume that a genome can be modeled as a two-state HMM, with two
states A and B. Assume that the parameters of this HMM are as
follows. State A has an AT-biased emission distribution {0.35, 0.15,
0.15, 0.35} for the probabilities of {A,C,G,T}. State B has a GC-biased
emission distribution {0.15, 0.35, 0.35, 0.15}. State A switches to state
B with probability 0.001. State B switches back to state A with
probability 0.01. Assume that the initial distribution for the HMM is
uniform; 0.5, 0.5 for the two states. (The file example.hmm contains these parameters; a C
module for a simple, general HMM structure and for reading this file
format is provided in hmm.c and hmm.h.)
The file example.fa contains a simulated 1 Mb
genome sequence, generated by this HMM. Lower case residues were
generated by state A; upper case residues by state B.
Viterbi parsing
Implement a Viterbi parser for the HMM above, including the
initialization, matrix fill, and traceback stages. Your program should
output the alignment as a series of segments, e.g.:
1 153 state A
154 252 state B
253 1651 state A
(... etc...)
(If you do the traceback backwards and your segments are ordered from
the end of the genome to the start, you can pipe the output through
UNIX "sort" to get the right order; don't worry about resorting them inside
your C program.)
Run your program on the 1 Mb simulated sequence in example.fa. How many segments of the genome are
in state B? (The true state path is in the file example.positions.)
Posterior decoding
Implement a posterior decoding parser for the HMM above, including
forward, backward, and the forward/backward calculation to obtain the
probability that each residue i in the sequence is in state A versus
state B. Find all segments of length >=50 for which the posterior
probability of state B is >= 0.5 for each residue in the segment, and
print them out.
Expectation/maximization
Using your forward and backward implementations (above), implement an
expectation maximization algorithm to estimate the parameters of an
HMM from randomized starting parameters. (You can assume that the
initial distribution \pi is still uniform - you only have one starting
point, so it's not really useful to try to estimate the initial
distribution from the data. Just estimate the transition and emission
probabilities of the two-state HMM.) Run your EM algorithm on the
simulated genome in example.fa. Does your
algorithm find the same parameters that actually generated the genome
sequence?
Beware: the EM algorithm can be computationally intensive. My
version of the code (unoptimized) takes about 5-10 minutes to run this
problem, and consumes 25 MB of RAM. You might want to watch your
program with the "top" command to make sure it doesn't accidentally
consume too much memory.
A real application: identifying structural RNA genes in Methanococcus
The Methanococcus jannaschii genome has an interesting
property. On average, the base composition of the genome is strongly
AT-biased. Some of its genes, though -- the structural RNA genes --
are strongly GC-biased. It turns out that the GC composition of bulk
genomes appears to vary freely, but the GC composition of structural
RNA genes is strongly correlated with the normal growth temperature of
the organism (we think this is because structural RNAs have to
maintain a base paired secondary structure; thus, at high temperature,
they drive towards high GC because folded high GC sequences have
higher thermostability).
The parameters above (and in the example.hmm) are pretty close to what you get
from estimating a two-state HMM on M. jannaschii bulk genome and
structural RNA genes, with state A as the bulk genome and state B as
the RNA genes.
A slightly modified copy of the complete M. jannaschii genome
is [here]. (I modified it by changing all the
ambiguous nucleotides to A's, so you wouldn't have to deal with them
in your programs.)
- Run your viterbi and posterior decoding parsers
on the real M. jannaschii genome.
How many RNA regions do you detect?
A file containing the positions of known transfer RNA genes
is [here]. How many of these
are detected by the Viterbi parse?
- Run your EM algorithm on the M. jannaschii
genome. Do you estimate about same parameters as above, or
does your EM algorithm find a different kind of segmentation
pattern in the genome?
Material and files that might be of use
Feel free to use any of these C modules to save yourself some work:
- hmm.c, hmm.h:
support for a general HMM structure, including allocation,
initialization, and reading an HMM from a file.
- example.hmm: the HMM parameters
given above, in the format parsed by the ReadHMMFile() routine
in hmm.c.
- fasta.c, fasta.h:
a C module for FASTA sequence file input.
- vectorops.c, vectorops.h:
Routines for dealing with probability vectors.
- sre_random.c, sre_random.h:
My portable random number generator, and various distribution-sampling routines.
- overlap, a Perl script for calculating
sensitivity and "specificity" (really positive predictive value, PPV)
given a file of predicted segment coords and a file of
true segment coords.