There are not equal proportions of type A and type B dice in the bag. The fraction of type A dice in the bag is q_A. The fraction of type B dice is q_B.
I have the superpower of being able to produce biased dice, with any probability of rolling 1 thru 6 that I want. The probability that a type A die rolls a 1 is p_A1, the probability that a type B die rolls a 5 is p_B5, etc.
Thus, a complete statistical model of the bag of dice consists of 14 probability parameters: q_A, p_A1..p_A6, q_B, and qB1..qB6.
Your problem is to infer all fourteen of those probability parameters, only by pulling one die at a time from the bag, rolling it some number of times, and repeating that process for some number of dice -- even when you don't for sure what type of die You are rolling.
Write a program that estimates the 14 probability parameters by maximum likelihood, given the data in the counts1 file. Your submitted answer will consist of both the C source code and the program's output (six parameter estimates).
To make this easier, you may use emfuncs.h and emfuncs.c. These provide some functions that will save you some work - in particular, a ReadDatafile() function for reading the input format. Read the comments in emfuncs.c; it has a little testdriver program at the end of the file, which shows you how the API to these functions works.
Hint: If you want to check your program against other data files, the program that I used to generate the counts1 file is here. You can compile it and use it to generate test data sets. The params2 file used in the next section gives an example of the input parameter file format that generate.c uses.
q p_1 p_2 p_3 p_4 p_5 p_6
-- ----- ----- ----- ----- ----- -----
A: .25 .3 .14 .14 .14 .14 .14
B: .75 .1666 .1666 .1666 .1666 .1666 .1666
These parameters are also in the file params2.
One at a time, I pull 20 dice from this bag and roll each one 80
times. The results are in the file counts2.
Write a C program that reads this file and, for each line (e.g. each die) estimates P(A | data) -- the probability that a type A die generated this line of observations -- and P(B | data) -- the probability that a type B die generated this line of observations. Output (for each line) the best (maximum posterior) guess of which die type was rolled to generate this line of observations, and what the posterior probability is.
Your submitted answer will consist of both the C program and its output -- 20 lines, each with the best classification and the posterior probability of that classificiation.
Hint: to check that your program works, use the provided generate.c program to generate test data sets for which you know the correct answer. Pipe the output through sed to strip off the labels; e.g., the file was produced using the command:
% ./generate params2 20 80 > counts2.labelled % sed -e 's/^[AB]//' counts2.labelled > counts2Vary not only the parameters, but also the number of dice and the number of rolls, and compare the posterior guesses to the true answers (in the labelled data file). Obviously, the more data you see, and the more distinct the dice parameters are, the more likely it is that you will guess right.
Write a C program that reads this file and estimates all 14 probability parameters, using expectation maximization. Your submitted answer will consist of both the C program and its output (fourteen parameter estimates).
Hint: Don't worry about being too fancy about testing for EM convergence. In the answer key program, I just run 20 iterations and check by eye for whether it converges (20 is more than enough, for this problem). A more robust solution would be to look for the summed difference between the new and old parameter set to be less than some number.
Further hint: Again, test your program using the