ibbly.com

Testing predicted probabilities

July 2009 (index)

Often we would like to predict not the actual outcome of an event ("It will rain tomorrow", "Liverpool will beat Derby") but instead the probability of someting happening ("There's a 20% chance of rain", "Liverpool have a 75% chance of beating Derby"). We'd like to be able to come up with lots of different prediction techniques and see which one is the best.

So what's a good way to test which prediction method works best? We can't tell based on just one prediction, but we should be able to compare methods over a large number of trials (rainfall over a year, football results over a season).

If we're dealing with a repeatable event (like tossing a coin) then we can get a good estimate of the answer by repeating it (toss it a large number of times). But our examples of weather and a football match those events aren't repeatable (tomorrow's chance of rain will be different from today's because the initial atmospheric conditions are different; the two football teams won't play each other at the same venue until next season, and then some of the players will have changed).

We can think of a world where each event has a true probability t of happening. There's no way for us to observe t but it's useful to consider when assessing our method for testing our predictons. If p is our predicted probability then we're trying to get p to be close to t.

So some conditions we want our prediction-scoring method to satisfy are:
(i) It can't use t (since we can't observe it). It can only use the predictions p and the outcomes (whether the event happened or not).
(ii) It should give the optimal score when p is exactly equal to t.

Condition (i) says that we want a score that looks something like:
Score(p) = A(p) if the event happens
Score(p) = B(p) if the event doesn't happen

The expected score is then E=tA(p)+(1-t)B(p).

Condition (ii) says that E should be optimal when p=t. So dE/dp = tdA/dp + (1-t)dB/dp should be zero when evaluated at p=t.

Solution 1

One solution to this is to set dA/dp=-2(1-p) and dB/dp=2p. (The 2s aren't necessary but make the next step neater.)

Integrating, A=-2p+p^2+a and B=p^2+b, where a and b are constants. A convenient choice is a=1, b=0 giving A=(p-1)^2 and B=p^2

We can summarise this as saying: score the square of the probability of the event that didn't happen.

To check the value of p that gives the optimal score we'll write p=t+e (where e is an error term). The expected score is t(1-p)^2+(1-t)p^2 and when we substitute in p=t+e the expected score reduces to t(1-t)+e^2.

This confirms that we've got a sensible scoring method (it gives the optimal value when the error e is nil) and also tells us that a low score is optimal.

We can extend this to a three-way prediction (eg a team can win, draw, or lose a football match) in a natural way. Effectively we're coming up with two predictions: probability p1 of a home win and probability p2 of an away win (and probability 1-p1-p2 of a draw) so we can set the score to be the sum of the scores of each of these predictions.

Doing this gives E = t1*(1-p1)^2 + (1-t1)*p1^2 + t2*(1-p2)^2 + (1-t2)*q2^2

Then putting p1=t1+e1, p2=t2+e2 gives E = t1(1-t1) + t2(1-t2) + e1^2 + e2^2

So this is minimised when e1 and e2 are both zero, ie when p1,p2=t1,t2

Name of "Statistic 1"

I'd be surprised if this hasn't been derived before. But I'm struggling to find out what it's called. If you know, please tell me.

Solution 2

Another solution to 0 = dE/dp = tdA/dp + (1-t)dB/dp when p=t is dA=1/p and dB=-1/(1-p).

Integrating this gives A=log(p) and B=log(1-p), so the score is given by adding up the logs of the probabilities of the events that occur. This is just the traditional log-likelihood.

Which to use?

Suppose two people are making predictions about how a coin will land when tossed. The predictions, and the actual outcomes from a million tosses, are as follows:

A) Heads=25% Tails=74% Edge=1%
B) Heads=50% Tails=50%
Outcome Heads=500,000 Tails=500,000

Both log-likelihood and "statistic 1" say that B's predictions are better than A's.

Now suppose the predictions were the same but the outcome was slightly different: Heads=500,000 Tails=499,999 Edges=1

"Statistic 1" again says that B's predictions are better than A's. But because an event happened which B said had a probability od zero (the coin landing on an edge) log-likelihood gives B a score of (informally) "minus infinity".

I would tend to say that B's prediction better represents the actual outcomes. But the log-likelihood statistic favours A under the second outcome. Even though A gets the balance between heads and tails badly wrong A is rewarded for giving the remote edge event a non-zero probability, even though A's probability for the edge differs from the empirical outcome by a factor of 10,000.

Under log-likelihood assigning a zero probability to something that then happens is a heinous error and is punished (infinitely) more that any other type of mistake. While I can see the theoretical justification it seems harsh for some practical purposes.

We'll use "Statistic 1" (I really need a better name) when playing around with football rankings.


ibbly.com contact