Homework 3 (designed by Dr. Sean Eddy for Bio5495 at WashU)

Parameter estimation by expectation-maximization

I give you a heavy bag containing lots of dice. There are two types of dice in the bag; call them type A and type B dice.

There are not equal proportions of type A and type B dice in the bag. The fraction of type A dice in the bag is q_A. The fraction of type B dice is q_B.

I have the superpower of being able to produce biased dice, with any probability of rolling 1 thru 6 that I want. The probability that a type A die rolls a 1 is p_A1, the probability that a type B die rolls a 5 is p_B5, etc.

Thus, a complete statistical model of the bag of dice consists of 14 probability parameters: q_A, p_A1..p_A6, q_B, and qB1..qB6.

Your problem is to infer all fourteen of those probability parameters, only by pulling one die at a time from the bag, rolling it some number of times, and repeating that process for some number of dice -- even when you don't for sure what type of die You are rolling.

A. Inferring parameters when I tell you the die type.

One at a time, I pull 100 dice from the bag, and (for now) I kindly tell you whether each die is type A or type B. Then I roll it 10 times. The observations you get from this are in the file counts1. Each line of this file represents the observed rolls of one die, so there are 100 lines in the file; the first field on each line tells you whether that die was a type A or type B die.

Write a program that estimates the 14 probability parameters by maximum likelihood, given the data in the counts1 file. Your submitted answer will consist of both the C source code and the program's output (six parameter estimates).

To make this easier, you may use emfuncs.h and emfuncs.c. These provide some functions that will save you some work - in particular, a ReadDatafile() function for reading the input format. Read the comments in emfuncs.c; it has a little testdriver program at the end of the file, which shows you how the API to these functions works.

Hint: If you want to check your program against other data files, the program that I used to generate the counts1 file is here. You can compile it and use it to generate test data sets. The params2 file used in the next section gives an example of the input parameter file format that generate.c uses.

B. Inferring the type of die when I tell you the parameters.

I pull out a new bag of dice. Now, I graciously tell you the parameters of this bag are:
 
      q     p_1   p_2   p_3   p_4   p_5   p_6
      --   ----- ----- ----- ----- ----- -----
  A: .25   .3    .14   .14   .14   .14   .14
  B: .75   .1666 .1666 .1666 .1666 .1666 .1666
These parameters are also in the file params2. One at a time, I pull 20 dice from this bag and roll each one 80 times. The results are in the file counts2.

Write a C program that reads this file and, for each line (e.g. each die) estimates P(A | data) -- the probability that a type A die generated this line of observations -- and P(B | data) -- the probability that a type B die generated this line of observations. Output (for each line) the best (maximum posterior) guess of which die type was rolled to generate this line of observations, and what the posterior probability is.

Your submitted answer will consist of both the C program and its output -- 20 lines, each with the best classification and the posterior probability of that classificiation.

Hint: to check that your program works, use the provided generate.c program to generate test data sets for which you know the correct answer. Pipe the output through sed to strip off the labels; e.g., the file was produced using the command:

   % ./generate params2 20 80 > counts2.labelled
   % sed -e 's/^[AB]//' counts2.labelled > counts2
Vary not only the parameters, but also the number of dice and the number of rolls, and compare the posterior guesses to the true answers (in the labelled data file). Obviously, the more data you see, and the more distinct the dice parameters are, the more likely it is that you will guess right.

C. Inferring everything, just from observed rolls: Expectation/Maximization

I pull out a new bag of dice. Now I smile evilly, and I tell you nothing about the parameters or the dice types -- as, one at a time, I pull 600 dice from the bag, and roll each one of them 80 times as you watch. The results are in the file counts3.

Write a C program that reads this file and estimates all 14 probability parameters, using expectation maximization. Your submitted answer will consist of both the C program and its output (fourteen parameter estimates).

Hint: Don't worry about being too fancy about testing for EM convergence. In the answer key program, I just run 20 iterations and check by eye for whether it converges (20 is more than enough, for this problem). A more robust solution would be to look for the summed difference between the new and old parameter set to be less than some number.

Further hint: Again, test your program using the generate.c program to generate data sets from known parameters, and test whether your EM program does a reasonable job of reproducing those parameters. (Am I doing enough to convince you to always test your new cool programs on simulated data sets of known properties, before applying them to unknown data sets?)