Question about use of Poisson distribution

NoSix · Dec 30, 2003

Let us say that San Jose were going to play DC in DC next week, and I would like to predict the result of that match.

If for simplicity's sake I am willing to assume that: 1) goals are Poisson distributed, and 2) any strength of schedule differences are small enough to be ignored, then it seems to me that if I know that San Jose averaged 1.53 GF and 1.07 GA on the road last season, while DC averaged 1.33 GF and 0.93 GA at home, then I ought to have enough information to make a statistical prediction of the match result, but I'll be damned if I can figure out how to do so.

Beineke, Voros, anyone?

beineke · Jan 1, 2004

Originally posted by NoSix
Let us say that San Jose were going to play DC in DC next week, and I would like to predict the result of that match.

If for simplicity's sake I am willing to assume that: 1) goals are Poisson distributed, and 2) any strength of schedule differences are small enough to be ignored, then it seems to me that if I know that San Jose averaged 1.53 GF and 1.07 GA on the road last season, while DC averaged 1.33 GF and 0.93 GA at home, then I ought to have enough information to make a statistical prediction of the match result, but I'll be damned if I can figure out how to do so.

Beineke, Voros, anyone?
Click to expand...

Before you can plug any numbers into the Poisson distribution, You still have one more decision to make -- you need to decide how the two teams interact. To estimate a distribution for San Jose's goalscoring in DC, you need to combine your info about San Jose's offense (1.53 gpg) with your info about DC's defense (0.93 gpg). There are many options for doing this, but the simplest is just to take the arithmetic mean (1.23 gpg).

Then you can plug 1.23 into your Poisson distribution and get:
Pr(SJ scores 0) = 29%
Pr(SJ scores 1) = 36%
Pr(SJ scores 2) = 22%
Pr(SJ scores 3) = 9%
Pr(SJ scores 4) = 3%
Pr(SJ scores 5 or more) = 1%

Then you can do the same thing for DC's offense paired with SJ's defense.

NoSix · Jan 1, 2004

Re: Re: Question about use of Poisson distribution

Originally posted by beineke
Before you can plug any numbers into the Poisson distribution, You still have one more decision to make -- you need to decide how the two teams interact. To estimate a distribution for San Jose's goalscoring in DC, you need to combine your info about San Jose's offense (1.53 gpg) with your info about DC's defense (0.93 gpg). There are many options for doing this, but the simplest is just to take the arithmetic mean (1.23 gpg).

Then you can plug 1.23 into your Poisson distribution and get:
Pr(SJ scores 0) = 29%
Pr(SJ scores 1) = 36%
Pr(SJ scores 2) = 22%
Pr(SJ scores 3) = 9%
Pr(SJ scores 4) = 3%
Pr(SJ scores 5 or more) = 1%

Then you can do the same thing for DC's offense paired with SJ's defense.
Click to expand...

Yes, thanks, I actually dug out my old college prob and stats text yesterday and hit on the same idea.
Adding up the probability of all outcomes, I got a prediction of 37.0% SJ win/27.5% Draw/35.5% DC win. Even though a draw is the least likely outcome, the most likely single result is 1-1, with a probability of 13%. Interesting!

Thanks again for taking time to respond.

Serie Zed · Jan 1, 2004

I can't believe I missed this forum for this long.

You method is probably pretty close to how the betting houses post odds, with the exception that they've got a lot more information about how two teams might interact.

beineke · Jan 2, 2004

Re: Re: Re: Question about use of Poisson distribution

Originally posted by NoSix
Even though a draw is the least likely outcome, the most likely single result is 1-1, with a probability of 13%. Interesting!

Click to expand...

Last season, 24 of 161 games ended in 1-1 draws -- that's 15%, definitely in the right ballpark. In many cases, the Poisson approximation is pretty good. Then again, we should probably note that the 2003 Quakes were not a very Poisson-like team. They played a total of 34 games, in which 100 goals were scored.

Using the Poisson model, we'd conclude that they only have a 7.8% chance of having 5 or more goals in a game. Over the course of the season, we would expect 2.65 games where that many goals were scored. Instead, it happened to the Quakes 8 times.

Another way to put this is that 50% of the goals (50 of 100) were scored in only 23.5% of their games (8 of 34). When goals came, they came in bunches.

mpruitt · Jan 2, 2004

Originally posted by Serie Zed
I can't believe I missed this forum for this long.

Click to expand...

yeah i try to pimp this forum as much as possiable but if you want to go into the New Forums thread of Suggestions and ask them to put us on the front page then that'd be great.

Would someone mind giving us an idiots guide defanition to what a poisson distrabution is?

Serie Zed · Jan 2, 2004

Geek that I am, I thought about this a bit more and think a better approach would look something like...

Generate a mean and standard deviation for goals scored and surrendered by both the home and visiting team.

Here you might just average DC's goals surrendered with San Jose's goals scored (and vice versa). But you could probably find a few simple tweaks that give you a bit mroe insight into how the two teams might interact.

For exmaple, if DC has the worst defense and San Jose the best offense, you might find (using past data) that you actually move the "average" towards San Jose slightly. Or increase the standard deviation. Or something.

Then you just plug the estimated average goals and estimated standard deviations into Crystal Ball and run trials to see what the range of results is.

NoSix · Jan 2, 2004

Originally posted by Serie Zed
Geek that I am, I thought about this a bit more and think a better approach would look something like...

Generate a mean and standard deviation for goals scored and surrendered by both the home and visiting team.

Here you might just average DC's goals surrendered with San Jose's goals scored (and vice versa). But you could probably find a few simple tweaks that give you a bit mroe insight into how the two teams might interact.

For exmaple, if DC has the worst defense and San Jose the best offense, you might find (using past data) that you actually move the "average" towards San Jose slightly. Or increase the standard deviation. Or something.

Then you just plug the estimated average goals and estimated standard deviations into Crystal Ball and run trials to see what the range of results is.
Click to expand...

My understanding is that the Poisson distribution is a one parameter distribution, with the variance equal to the mean. As a practical matter, this means it is already "built in to the distribution" that teams that score more goals will also have more variability in the number of goals scored. Of course, I'm not a statistician, but maybe one of them on here can give you an expert opinion on your idea.

NoSix · Jan 2, 2004

Re: Re: Re: Re: Question about use of Poisson distribution

Originally posted by beineke
Last season, 24 of 161 games ended in 1-1 draws -- that's 15%, definitely in the right ballpark. In many cases, the Poisson approximation is pretty good. Then again, we should probably note that the 2003 Quakes were not a very Poisson-like team. They played a total of 34 games, in which 100 goals were scored.

Using the Poisson model, we'd conclude that they only have a 7.8% chance of having 5 or more goals in a game. Over the course of the season, we would expect 2.65 games where that many goals were scored. Instead, it happened to the Quakes 8 times.

Another way to put this is that 50% of the goals (50 of 100) were scored in only 23.5% of their games (8 of 34). When goals came, they came in bunches.
Click to expand...

If you use MLS regular season average home and away goals (rather than just SJ and DC) then the predicted probability of a 1-1 draw is 11.4%, though your point remains valid.

I wonder what the probability is of seeing 8 vs. the expected number of 3 games with 5 or more goals. If you go back to JG's post at the end of the season, the Poisson distribution still predicted SJ's points pretty accurately:

Team GF GA Pts PrPts Diff
San Jose 45 35 51 48.08 +2.92

As unlikely as they may seem, perhaps the 5 goal outbursts are still just random variation?

NoSix · Jan 2, 2004

Re: Re: Re: Re: Question about use of Poisson distribution

Originally posted by beineke
Then again, we should probably note that the 2003 Quakes were not a very Poisson-like team. They played a total of 34 games, in which 100 goals were scored.

Using the Poisson model, we'd conclude that they only have a 7.8% chance of having 5 or more goals in a game. Over the course of the season, we would expect 2.65 games where that many goals were scored. Instead, it happened to the Quakes 8 times.

Another way to put this is that 50% of the goals (50 of 100) were scored in only 23.5% of their games (8 of 34). When goals came, they came in bunches.
Click to expand...

By my calculation, if 100 goals were scored in 34 Quakes games, then the probability of 5 or more goals being scored in any one game is 17.48% and the expected number of 5 or more goal games is 6, not so different from 8, or am I screwing up something?

beineke · Jan 3, 2004

Re: Re: Re: Re: Re: Question about use of Poisson distribution

Originally posted by NoSix
By my calculation, if 100 goals were scored in 34 Quakes games, then the probability of 5 or more goals being scored in any one game is 17.48% and the expected number of 5 or more goal games is 6, not so different from 8, or am I screwing up something?
Click to expand...

Nice catch ... I had mistakenly used the number for Pr(> 5 goals) instead of Pr(>= 5). So, I take it back -- nothing too surprising about those numbers after all ...

Serie Zed · Jan 3, 2004

Maybe this is a PM, but...

Could someone who knows explain why you would or wouldn't use Poisson instead of using the mean AND std deviation suggested in the data?

Thanks!

microbrew · Jan 4, 2004

Hmm, a layman's description of the Poisson distribution. I'll take a stab, and I'll leave out the equations. This will sort of answer Serie Zed's post too.

The Poisson distribution comes from a Poisson process. Basically, we're counting events happening in time: when goals occur in a soccer game, when customers arrive at a service station, when packets arrive at a router, or when accidents occur at a particular intersection.

To generate the probablity distribution function (pdf) for a Poisson distribution, only one key parameter needs to be known: the average number of events in the given time interval.

From this pdf, you can get a mean, standard deviation, variance, etc. Then, compare them to the ones computed from the data.

There are other distributions. A possibly more familiar one is the normal or Gaussian distribution, what people normally associate with a bell-shaped curve.

More gory details:

A Poisson process has certain criteria: 1) at zero time, zero events have occured, 2) each event is independently and identically distributed, 3) over a small enough time difference, the probably of two events occuring is nil, and 4) the events occur at a constant average rate

A soccer game doesn't really fulfill those conditions, but it's close enough. An earlier thread notes that more goals are usually scored towards the ends of games, for most teams and leagues.

I have some links to some papers using the Poisson distribution in sports- but it will take time to dig it out. One link I referred to is from the makes of Mathematica, http://mathworld.wolfram.com/PoissonDistribution.html

NoSix · Jan 4, 2004

Originally posted by Serie Zed
Maybe this is a PM, but...

Could someone who knows explain why you would or wouldn't use Poisson instead of using the mean AND std deviation suggested in the data?

Thanks!
Click to expand...

Microbrew explained the Poisson distribution much better than I could have; hopefully that was helpful to you.

Let me try to make my response more explicit. To uniquely specify a Normal distribution requires two independent parameters, the mean and standard deviation. So for a normal distribution it makes sense to fix the mean at one value and look at the impact of different standard deviations on, say, the shape of the PDF.

To uniquely specify a Poisson distribution requires only one independent parameter. The standard deviation of a Poisson distribution is dependent on the mean. Since for any given value of the mean, the standard deviation is a fixed value (which you can calculate), the standard deviation of a Poisson distribution doesn't provide any additional information beyond what the mean provides.

Serie Zed · Jan 4, 2004

Originally posted by NoSix
To uniquely specify a Poisson distribution requires only one independent parameter. The standard deviation of a Poisson distribution is dependent on the mean. Since for any given value of the mean, the standard deviation is a fixed value (which you can calculate), the standard deviation of a Poisson distribution doesn't provide any additional information beyond what the mean provides.
Click to expand...

It's possible that I'm missing the boat entirely here, but I THINK I've got it.

And my (refined) question is...

IF the standard deviation in the actual data is not at least approximately the same as the mean, can/should you still use the Poisson?

In other words, if the Poisson assumes the std dev to be the same as the mean, and they aren't in reality the same, is it a mistake to use Poisson?

beineke · Jan 4, 2004

Originally posted by Serie Zed
In other words, if the Poisson assumes the std dev to be the same as the mean, and they aren't in reality the same, is it a mistake to use Poisson?
Click to expand...

Answer: not necessarily.

Somebody once defined statistics along the following lines: "the art of drawing correct conclusions from wrong assumptions." Depending on what you're modeling, you can often get away with using crude approximations to reality.

That's the simplest answer. A more complex answer is that this particular phenomenon -- that the observed variance doesn't match what it "should" be -- is very common in applied statistics. It's known as overdispersion, and there are techniques of adjusting for it. If somebody's really interested, they should be able to find a good reference ... IIRC, there are a couple pages of discussion in MacCullagh and Nelder's book on Generalized Linear Models.

NoSix · Jan 4, 2004

Strength of Schedule

Anyway, back to using the Poisson distribution to predict match results...

I did one semi-neat thing over the weekend, which was to incorporate match-specific predictors which account for strength of schedule differences in the match history of the teams. I got the idea from one of the papers in Microbrew's reading material thread (in this forum). The authors proposed eight different predictors, none of which I really liked, so I made up my own. Basically, the predictor defines the interaction between the two teams, so the algorithm outputs one set (two values, one for each team) of Poisson parameters for each team in the league, home and away versus every other team in the league. The parameters predicted including strength of schedule adjustments are significantly different in some cases from those using the averages of home GF and away GA, and home GA and away GF.

NoSix · Jan 4, 2004

By the way, the reference was posted by Microbrew, but in the Sabremetrics thread, page 4, not the reading material thread. The title is "A Simulation Model for Football Championships".

voros · Jan 4, 2004

Originally posted by beineke
Before you can plug any numbers into the Poisson distribution, You still have one more decision to make -- you need to decide how the two teams interact. To estimate a distribution for San Jose's goalscoring in DC, you need to combine your info about San Jose's offense (1.53 gpg) with your info about DC's defense (0.93 gpg). There are many options for doing this, but the simplest is just to take the arithmetic mean (1.23 gpg).
Click to expand...

This method can cause problems.

Let's say Team A plays Team B. Team A scores an average of 2 goals a game. Team B allows an average of 2 goals a game. Let's say the average team in the league allows 1.25 goals a game.

Using the process you showed above, team A would have an expected goals of 2 goals, which is what they normally do. But they average 2 goals against an _average_ team. Team B allows more goals than the average team, so therefore Team A should score more goals than they normally do, not the same amount.

Here's an alternative:

Express goals as a percentage of half-minutes played (IE a 90 minute game would 180 half-minutes). You don't have to be exact, using games *180 is fine for simplicity's sake (90 minutes will work just as well, just that theoretically if you divided the game into 90 distinct intervals, two goals could be scored in one interval which is a no-no for poisson. Still works fine, but a stat geek like myself might blow a gasket). So using a real world example:

San Jose 45 GF, 35 GA, 30 Games, 5,400 half-minutes
Dallas 35 GF, 64 GA, 30 Games, 5,400 half-minutes
League 433 GF, 433 GA, 300 Games, 54,000 half-minutes (only 150 games were actually played but this is used for obvious reasons)

Calculate the rate at which each scored and allowed per half minute

San Jose .0083 GFR, .0065 GAR
Dallas .0065 GFR, .0119 GAR
League GFR and GAR = .008

Now, we'll call San Jose's Goals For rate 'a', Dallas' goals allowed rate 'b' and the league rate 'c'. If we want to know the rate at which San Jose would score on Dallas, which we'll call 'd', we use the following formula:

d = ((a*b)/c)/(((a*b)/c)+(((1-a)*(1-b))/(1-c)))

Or in words the numerator is san jose's rate times dallas' rate divided by the league rate. The denominator is the numerator plus one minus san jose's rate times one minus dallas' rate divided by one minus the league rate:

d = ((.0083*.0119)/.008)/(((.0083*.0019)/.008)+(((1-.0083)*(1-.0119))/(1-.008))) =

.0123

Since there's 180 half-minutes in a 90 minute game, you multiply .0123 * 180 and get 2.214 goals a game.

Then you can plug that into the Poisson distribution or else for consistency's sake you could use a binomial distribution with the number of trials set at 180 (but be creative in how you calculate this or you'll break your computing device). Use poisson as the two are going to be real damn close anyway (all poisson is a binomial with those 180 half minutes divided up into infinitely small pieces).

See the key is that Dallas should give up MORE goals against San Jose than they normally do since San Jose scores slightly more often than the average team. The method of simply calculating the mean actually has them allowing less goals.

NoSix · Jan 4, 2004

Originally posted by voros

Express goals as a percentage of half-minutes played (IE a 90 minute game would 180 half-minutes).

<snip>

Now, we'll call San Jose's Goals For rate 'a', Dallas' goals allowed rate 'b' and the league rate 'c'. If we want to know the rate at which San Jose would score on Dallas, which we'll call 'd', we use the following formula:

d = ((a*b)/c)/(((a*b)/c)+(((1-a)*(1-b))/(1-c)))

Click to expand...

Well, if you add up the time it takes to 1) rip your shirt off, 2) make love to the corner flag, 3) jump into the stands, 4) fall out of the stands, 5) get your shirt back on, and 6) get back onto your side of the pitch, my personal opinion as a non-statistician is that you would be safe sticking with 1-minute intervals. ;-)

I like your method. I agree it is a better way of making a result prediction based only on the information in my original post (plus the league averages, obviously). Of course, your method doesn't account for strength of schedule differences, but from a standpoint of simplicity with reasonable accuracy it looks like a good approach.

voros · Jan 4, 2004

Originally posted by NoSix
Well, if you add up the time it takes to 1) rip your shirt off, 2) make love to the corner flag, 3) jump into the stands, 4) fall out of the stands, 5) get your shirt back on, and 6) get back onto your side of the pitch, my personal opinion as a non-statistician is that you would be safe sticking with 1-minute intervals. ;-)

I like your method. I agree it is a better way of making a result prediction based only on the information in my original post (plus the league averages, obviously). Of course, your method doesn't account for strength of schedule differences,
Click to expand...

To do that correctly would require busting out one of those power ratings techniques like KRACH modified to do what you like.

Those are some pretty dense creatures (based on recursions, IE computer making lots of guesses until it's right), and all that's going to effect is the accuracy of your estimates of a team's goal scoring (and preventing) ability. It shouldn't have any bearing on the methods used after that.

Anyway, the above formula can be reversed so that if the quakes averaged .0083 goals per half minute, the league average was .008 and the average Quakes opponent during the year had a rate of .0087 allowed (that's just a random number I picked)...

...this time we designate the quakes rate as 'd', the average opponents rate as 'b' the league average remains 'c', and we do some algebra to solve for 'a' we get:

a = ((b*c*d)-(c*d))/((b*d)-(c*d)-b+(b*c))

or

a = ((.0087*.008*.0083)-(.008*.0083))//((.0087*.0083)-(.008*.0083)-.0087+(.0087*.008)) =

.0076.

What this means is that a team that scores an average of .0083 goals per half minute against a team that allows .0087 per half minute in a league that averages .008 per half minute, will score .0076 against an "average" team.

IOW, you could use that formula to adjust for competition level if you'd like.

mpruitt · Jan 4, 2004

As one of the people who really pushed to get this forum up, this is a proud moment. I knew there'd be a day when there'd be a discussion in a thread that would have me completely lost. This is just that. Kudos guys.

NoSix · Jan 4, 2004

Originally posted by maxim-1
As one of the people who really pushed to get this forum up, this is a proud moment. I knew there'd be a day when there'd be a discussion in a thread that would have me completely lost. This is just that. Kudos guys.
Click to expand...

Where did you get lost - was it at ripping the shirt off, or falling out of the stands... ;-)

NoSix · Jan 4, 2004

Originally posted by voros
To do that correctly would require busting out one of those power ratings techniques like KRACH modified to do what you like.

Those are some pretty dense creatures (based on recursions, IE computer making lots of guesses until it's right), and all that's going to effect is the accuracy of your estimates of a team's goal scoring (and preventing) ability. It shouldn't have any bearing on the methods used after that.

Click to expand...

Well, as I indicated in my earler post, the match-specific estimator approach accounts for strength of schedule and I was able to implement it in an Excel spreadsheet.

beineke · Jan 5, 2004

Originally posted by voros
Here's an alternative:

Or in words the numerator is san jose's rate times dallas' rate divided by the league rate. The denominator is the numerator plus one minus san jose's rate times one minus dallas' rate divided by one minus the league rate:

Click to expand...

In effect, you're proposing a multiplicative model instead of an additive one. That seems reasonable.

But it's worth pointing out that there is a simple way to plug the league-wide average L into the original additive model. Instead of the average of the team rates [(A + B)/ 2], you can use the sum of the rates minus the league average [A + B - L].

IMO, the additive model is easier to work with, but the multiplicative model is probably a little more realistic.