Monday, January 21, 2008

Predicting a Primary Winner before CNN Does

It is primary season, and several primaries are being held. After they are held, the networks show the returns and eventually call winners. Sometimes they call winners immediately after the event closes, as with the Nevada Republican caucuses on 2008 January 19, when they called Romney the winner. However, at night they did not call for a long time the winner of the South Carolina Republican primary, nor the Nevada Democratic caucuses, which were held the same day.

Nevertheless, I was able to call the winner in the South Carolina primary at 7:16 pm, as described on my Beyond Opinion site. How did I do this?

It turns out that CNN provides election totals for the candidates after the polls or caucuses close. These don't help with close contests because they can easily reverse. CNN does carry out exit or entrance polls, however. These list the vote according to various aspects of the electorate, including male/female, church attendance, party affiliation, feelings on immigration and so forth. Here is an example of one of the exit polls in the Republican South Carolina primary, after extracting to Microsoft Excel:
Feelings About Bush Administration Candidate Huckabee McCain Romney Thompson
Enthusiastic -0.17 0.28 0.34 0.18 0.18
Satisfied -0.52 0.35 0.3 0.14 0.17
Dissatisfied -0.25 0.29 0.38 0.14 0.13
Angry -0.05 0.15 0.44 0.22 0.12
It shows the percentage of the electorate in each of four categories: Enthusiastic, Satisfied, Dissatisfied, and Angry, in parentheses. Excel thinks these are negative numbers, and so it stuck minus signs in front of each entry. But they are really positive, and they give the percentage distribution of the electorate across these four categories. Note that I have deleted the minor candidates to avoid distorting the web page with a wide table. This means that the row sums will not add up - the difference is the total of the minor candidates.

For each category, it shows the percentage of each of the votes for the category according to the candidate they voted for or supported. So those who were Satisfied with Bush voted 2% for Giuliani, 35% for Huckabee, 0% for Hunter, and so forth. The 35%, or 0.35, then shows a conditional probability: the probability that a voter voted for Huckabee given that he was satisfied with Bush. The formula for a conditional probability is:

p(A|B) = p(A & B)/p(B)

where A & B means both A and B.

Therefore:

p(Huckabee|satisfied) = p(Huckabee & satisfied)/p(satisfied)

Now if one sums over all the categories, one gets:

P(Huckabee|satisfied) = p(Huckabee & satisfied)/p(satisfied) + p(Huckabee & enthusiastic)/p(enthusiastic) + p(Huckabee & dissatisfied)/p(dissatisfied) + p(Huckabee & angry)/p(angry)

But this is the same as

P(Huckabee|satisfied or enthusiastic or dissatisfied or angry)

This is simply p(Huckabee), the percentage of the vote that went to Huckabee, assuming a person being polled could not say "none of the above". So this gives us a means of finding out for each candidate what percentage of the vote went for each candidate.

To do this, one must take pairwise products of two columns from this array, and add these together. It turns out that Excel has a function, namely SUMPRODUCT, that does this. So one could enter in the box below the Giuliani column:

-SUMPRODUCT($B2:$B5,C2:C5)

And this gives the Giuliani vote. Then simply copy across the candidates. I put a minus sign in front to counteract the unwarranted assumption that Excel made about the category distribution percentages being negative. The dollar signs tell Excel to keep this coordinate at B; that is, always use the category percentages rather than move across the spreadsheet. The result is:
Feelings About Bush Administration Candidate Huckabee McCain Romney Thompson
Enthusiastic -0.17 0.28 0.34 0.18 0.18
Satisfied -0.52 0.35 0.3 0.14 0.17
Dissatisfied -0.25 0.29 0.38 0.14 0.13
Angry -0.05 0.15 0.44 0.22 0.12
    0.3096 0.3308 0.1444 0.1545
And this shows that McCain won with 33% of the vote, with Huckabee at 31% of the vote, Thompson with 15%, and Romney with 14%. These numbers and hence my spreadsheet analysis were available for 30-45 minutes before CNN called the race for McCain.

I did the same with the Democratic caucuses in Nevada and concluded that Hillary Clinton won. This technique will be useful for all the following primaries, provided CNN provides an exit or entrance poll immediately after the polls close and does not call a winner right away. It is good to double check by using two or three of the categories, to be sure about the same results are obtained each time.

How good is this technique? Only as good as the exit polling of CNN (and other networks). Exit polling is much more reliable than election returns, since they cover the entire state, rather than first the urban results and then the rural ones that require hand-counting, for example. There has been only one major error of an exit poll that I know about, namely the call of Florida for Gore in 2000.