Sunday, October 19, 2008

Poll-Driven, Part II

I've mentioned before that I always lose some productivity in the weeks running up to an election because I waste too much time checking the polls. Now I want to raise a mathematical question about the websites that aggregate polling data, particularly the two that I check most: Real Clear Politics and It seems to me that their analysis is off statistically (although both are terrific sites). I don't actually know much about statistics, but I do have an undergraduate degree in mathematics, so I think I at least have a valid question, although I'm not sure of the ultimate answer.

Here's the thing: these websites are gathering up polling data and averaging the data for each race. They weight polls equally. So if one poll shows Obama ahead by 6 in a state and another poll shows Obama ahead by 4, the websites average the two polls and say that Obama is ahead by 5 in that state.

That seems too simplistic to me. At the very least, I think the polls need to be weighted to reflect the number of voters polled in each.

Let’s do a simple example to show why. Suppose one poll surveys 1000 likely voters in a state, and 600 say they’re voting for Obama and 400 say they’re voting for McCain. This poll reports that Obama is ahead 60-40. (I'm making the simplifying assumption that the pollster just reports the raw numbers, which is not what most pollsters actually do, but the issue would be the same regardless.)

Another poll in the same state surveys 500 likely voters, who are evenly split. So this poll reports that the race is 50-50.

Now, the aggregating websites would simply average these two polls and report that Obama is ahead 55-45. But that’s not what the numbers show! Altogether, the two polls surveyed 1500 people. Of those 1500, 850 said they were voting for Obama (600 in the first poll and 250 in the second) and 650 said they were voting for McCain (400 in the first poll and 250 in the second). And 850/1500 = 57% (actually 56.67%), not 55%. So the correct reporting of the two polls combined should be 57-43, not 55-45. That’s a noticeable difference.

So there’s something wrong with just averaging poll numbers equally.

That’s not even considering the fact that the polls might be taken on different days and use different methodologies. But I don’t see any easy way to correct for that. But there is an easy way to correct for the different number of people surveyed in each poll. I think the aggregating websites should take this into account., another poll aggregator, does assign weights based on sample size. But they also assign weights based on other factors that seem pretty subjective.

So the bottom line is that polling, and taking "polls of polls," are more complicated that they appear. The votemaster has a good run-down of polling issues.

Update (10/20): I sent my question to the Votemaster, who kindly sent me a reply, in which he said that my point was correct, but that there were so many other issues regarding polls (such as the order of the questions, or whether party identification is given) that the sample size is "down there in the noise."

No comments: