Hi All, Was having one of those moments, when I just thought can the Pythagorean Expectation (http://en.wikipedia.org/wiki/Pythagorean_expectation) work for a sport where there are a lot of draws (e.g. I know it works fine for baseball and basketball) without going in to, a lot of complex maths (I did A-Levels in maths but that was a while ago). In short I had EPL data from 2003 to 2011 (grabbed from various places including wikipedia for convenience) and so thought I would give it go (see attached pdf for what happened next). I decided to try and work it out based on points as I figured the win% route wasn't the way to go (I tried it and didn't like the answers). Basically I assumed the exponent route would provide an optimal result as opposed to squaring everything (e.g. like Daryl Morey did for Basketball), set up a spreadsheet in Excel with the formula for predicted points referring to the Exponent cell (I just assumed 0.8 to start with), and set it up like a least squares problem and asked it minimise the error with the exponent cell as the variable. I didn't like the result this gave (average error was in excess of 5), so I added a second variable into the equation. Ran the solver routine and got a result where the average error was 0 (to a few dp). Out of 160 results between the 2003/04 season and the 2010/11 season 65% fall within 5 points of the predicted result (based on an exponent of 1.01965121131284, and then multiplying the whole equation by 0.869567219205756 - I don't know if this implied 13.1% of games over those season were draws?). I think the values could be optimised if I was looking at less than 8 season but I was looking to get a general feel. With 65% of the results falling within 5 points I was quite happy, yet I was more interested in the 35% that didn't. Under achieving teams by 5-15 points are highlighted in red, over achieving teams by 5-15 points are show in yellow, overachieving teams by over 15 points are shown in green. What interested me was that there was roughly an equal number of teams that slightly over achieved and under achieved (e.g by 5-15 points 26 over and 27 teams under) but only 3 teams that really overachieved Arsenal in 04/05, and Arsenal and Chelsea in 09/10. In short has anyone else looked into this (bizarrely I was keen to try and apply this to Handball or Australian Rules), without getting too deep into the maths has anyone got a more accurate way of running this, does anyone have any thoughts on this (e.g. I can have 09/10 Chelsea being a team that overachieved as based on least squares ratings I have them as the EPL team of the last decade, but what about some of the others?). I'm happy to bounce something around or back and forth. Thanks,
Dr. Howard Hamilton, who makes http://www.soccermetrics.net, uses an exponent of 1.7 for soccer. He has posted tables with an estimated number of wins, draws, and losses (not just points). Without knowing as much math as he does, I've done the following in a spreadsheet: 1. Enter every club's goals scored, goals allowed, and points for a completed season. 2. Had a spreadsheet calculate goals scored^1.7, goals allowed^1.7, and goals scored^1.7/(goals scored^1.7 + goals allowed^1.7) 3. For each club, I took their value of goals scored^1.7/(goals scored^1.7 + goals allowed^1.7) and divided by the sum of that for the whole league. 4. I multiplied the result of Step 3 by the league's total points for the season so that the sum of the estimates would equal the actual points. Dr. Hamilton's estimates yield totals close to, but not exactly the same as, the points earned by all the clubs combined. You asked if 13.1% of games are draws and it's much more than that.
Thanks for the info, that's really interesting. I had also found the same site (I only really started looking at this last Sunday and it has been a bit of a busy week). I took one look at the equation on there and thought - no thanks (I am fairly okay mathematically I usually like to see a half worked example, or how it could be done in Excel). Perhaps I should have looked at it in a touch more depth though as what you've suggested sounds workable. Dependent on the way you set the formula up in the sheet (and how you set the sheet up), I am just taking a guess that you could have the exponent value of 1.7, set up as a variable and then use the solver function (I'm assuming that you are using Excel) to get the optimal value based on the goals scored and goals conceeded data?
I don't know what the solver function is. I could manually adjust the exponent from 1.7 to nearby numbers like 1.68 and see which correlates best with actual point totals, but that will vary slightly by season and league so I'm fine with using what Howard Hamilton says. Something else I think would be interesting if it was included in tables is each club's goal differential in games they won and in games they lost. If a club has the fourth most points and a negative goal differential in a 20 club league like Everton in 2004-2005, they must have had their average margin of defeat in losses be greater than their average margin of victory in wins, but there's no way to calculate these based on the standings tables. You have to look at the scores of each game to do that. Edit: I have many files on my computer about soccer statistics with titles such as "English Premier League 2008-2009 Goals and Point Percentages- Exponents and Correlations" that I can e-mail you if you tell me your e-mail address.
Solver is essentially an optimization algorithm feature within Excel: http://office.microsoft.com/en-us/excel-help/about-solver-HP005198368.aspx?pid=CH010004571033 In the sheet I had I set all of the formulas in the predicted points column, all referring to the same exp value cell in the top left (my initial guess at an exp value was 0.8). I then worked out the error of the prediction and squared it (to remove any negative values). I then set solver the task of minimizing the sum of the squared errors (the 5755 value up the top) which would effectively reduce the error of each prediction by only allowing it to vary the exp value - hence I got the optimum exp value that would minimize any errors (if that all makes sense). It's essentially least squares regression (something which I never understood when I was in school as I could never see where it was going). I'm totally happy to switch files, I have a load of regression based analysis (basic stuff) which I'd be happy to bounce back and forth. Really looking forward to this Opta data being released (which I mentioned in my other post) as well, should be fun to work with.
Yeah I think this is do-able as well. If you had the EPL season match data: http://www.football-data.co.uk/englandm.php You could set this up in Excel with a COUNTIF formula and a few other things around it to ensure you got what you wanted (I have a sheet somewhere that would do this I think - or I know another forum where I would get pointed in the right direction if I made a start). I'm happy to bounce something around?
Evan - I did it your way and had a classic result; it provided me with the same exponent value to about 5 decimal places as I had originally (I ran it through solver to optimise it) with my way above - but the different steps that you did removed the need for the second variable (the 0.86 figure) - I haven't followed the maths through but it gives me virtually the same set of values in every case to 3 to 4 dp in terms of predicted points. The interesting things is that this leaves me with the same questions as I had in my OP (e.g. did the 09/10 Chelsea team really overachieve by that much?). For some reason I couldn't attach a pdf when I posted the reply but I have a pdf of the results to back this up.
Some discussion of this in the sticky at the top of the forum, mainly in the earlier posts. Also see Pythag in soccer, another look.
Thanks. I had spotted the Voros thread, but hadn't checked the sticky at the top. The Voros method looked more involved than I was looking for, but looking at it in a different way now - I think I have seen a way of making work cleanly and simply within a spreadsheet. Still can't post a pdf of the sheet I have doing things Evan's way for some reason...........
Did the Voros method using the same data as above (EPL 2003 - 2011), and ran it through solver to get optimised exponent values (I don't know which league he was working off and over how many years so I may not be being fair). Instead of 0.3235 and 0.3948 I got 0.090692838 and 0.037077542. Now I'd just need to figure how to work the winning percentage (on those terms with a draw equaling 0.5 wins) backwards, to start figuring out wins, draws and losses (which I think is do-able). It's a different way of doing things and may be a cleaner way forward. Still can't post a pdf of this data for some reason but am happy to share.