voros
12 Jul 2006, 08:24 PM
The discussion on the US youth board here:
http://www.bigsoccer.com/forum/showthread.php?t=379969
was on the verge of becoming a technical one, so I decided to move that part of the discussion here.
Numerista's concern with my system (as I understand it), was that the strength of schedule adjustments that occur due to results in 2005 are affected by the results those teams had in 1999 and 2001. So that a strong 2005 Mexican U17 team would represent a weaker strength of schedule factor than the opposition they faced in 2005. And of course the inverse problem with teams who faced them in previous years: IE playing Mexico in 2005 was more difficult than playing them in 2001.
This is easy enough to solve. All you have to do is calculate separate ratings for each cycle and then average them up, or use a weighted average favoring more recent results. That way Mexico's 2003 quality has no effect on the strength of schedule factor for their opponents in 2005 (and vice-versa).
My counter-argument to this is that any set of rankings, even with a very large sample has some sort of assumed inherent error range. The rankings represent our best guess as to their team's strength, but we assume that this guess has an element of some unreliability to it. If we do the ratings based on a single cycle, if the error range for the ratings around each team is essentially random, numerista's suggestion still works. That is, if the information in those rankings is all we really have to go on, then despite the inherent inaccuracies, we've constructed things as best as we can.
The problem of course is that results from the 2003, 2001 and 1999 cycles do likely give us clues as to which direction and size the the 2005 cycle ratings point to. Here's a good recent example. It is very easy for me to use my poisson win% formula to calculate win rankings for national teams (they are about as good as doing two separate rankings, though less informative). As such it is also easy for me to construct the rating system around the 64 World Cup matches that just took place. Here are the final rankings:
1. Italy - 1054
2. France - 1054
3. Switzerland - 857
4. Brazil - 847
5. Spain - 463
6. South Korea - 319
7. Ghana - 251
8. Ukraine - 216
9. Australia - 175
10. Czech Republic - 157
11. Croatia - 128
12. U.S.A. - 109
13. Germany - 108
14. Portugal - 96
15. Togo - 94
16. England - 87
17. Argentina - 86
18. Japan - 74
19. Tunisia - 52
20. Netherlands - 47
21. Saudi Arabia - 43
22. Ecuador - 37
23. Sweden - 36
24. Mexico - 34
25. Paraguay - 24
26. Ivory Coast - 22
27. Angola - 20
28. Iran - 11
29. Poland - 11
30. T & T - 10
31. Serbia and Montenegro - 8
32. Costa Rica - 6
As you can see, many of these ratings don't match the public perception of how well various teams did. The US is ahead of Germany the Ukraine ahead of Argentina, etc. Why? The short answer is sample size. But why did the small sample size have this result?
Because, to put it simply, we have additional information at our disposal in which to gauge team strength that goes beyond their world cup results. We believe Argentina is a stronger side than Ukraine, based largely on information gleaned previous to the tournament. So to the extent that their world cup results do not reflect their overall strength of a side, this in turn affects the ratings of their opponents they faced in the cup. Mexico tying Argentina is a much better result than the Swiss tying the Ukraine, but the above ratings don't see it that way and so both of the former suffer from both of the latter.
IOW, while how well a team played in 2004 has no direct bearing on how they played in 2006, given our limited sample, it does give us hints as to the overall strength of various teams in 2006 that go beyond their 2006 results. Therefore, if there tends to be much consistency in the strength of national teams from year to year (there is) knowing what teams did in 2005, 2004, 2003, 2002 and 2001 does give us more information with which to interpret the results in 2006.
And this holds true for youth teams as well. If Spain or Argentina fails to make it out of the first round of a youth tournament, the teams that beat them probably deserve credit for beating a youth team that has consistently been excellent in those tournaments, rather than simply dismissing the Argentina team as being unusually subpar.
If in a youth cycle Spain rates a 90 and Australia rates a 120, we know those ratings have some error range compared to the actual strength of those teams. My contention is that the error range for those ratings is not random, but rather previous youth tournament results tell us information about the extent and the direction to which those ratings 'miss'. If they are not random, and we assume Spain has a much better chance of having been better than that rating than Australia, and a much lower chance of being worse, then of course the ratings themselves should be changed to accomodate this info. Once the ratings for those teams change, everybody's rating changes.
And so through that prism, it makes sense to use results from 2003 and 2001 to gauge the results from 2005. Without large team strength independence from cycle to cycle or large samples per cycle (and we are without both), how we evaluate what happened in 2005 has much to do with what happened in previous cycles.
http://www.bigsoccer.com/forum/showthread.php?t=379969
was on the verge of becoming a technical one, so I decided to move that part of the discussion here.
Numerista's concern with my system (as I understand it), was that the strength of schedule adjustments that occur due to results in 2005 are affected by the results those teams had in 1999 and 2001. So that a strong 2005 Mexican U17 team would represent a weaker strength of schedule factor than the opposition they faced in 2005. And of course the inverse problem with teams who faced them in previous years: IE playing Mexico in 2005 was more difficult than playing them in 2001.
This is easy enough to solve. All you have to do is calculate separate ratings for each cycle and then average them up, or use a weighted average favoring more recent results. That way Mexico's 2003 quality has no effect on the strength of schedule factor for their opponents in 2005 (and vice-versa).
My counter-argument to this is that any set of rankings, even with a very large sample has some sort of assumed inherent error range. The rankings represent our best guess as to their team's strength, but we assume that this guess has an element of some unreliability to it. If we do the ratings based on a single cycle, if the error range for the ratings around each team is essentially random, numerista's suggestion still works. That is, if the information in those rankings is all we really have to go on, then despite the inherent inaccuracies, we've constructed things as best as we can.
The problem of course is that results from the 2003, 2001 and 1999 cycles do likely give us clues as to which direction and size the the 2005 cycle ratings point to. Here's a good recent example. It is very easy for me to use my poisson win% formula to calculate win rankings for national teams (they are about as good as doing two separate rankings, though less informative). As such it is also easy for me to construct the rating system around the 64 World Cup matches that just took place. Here are the final rankings:
1. Italy - 1054
2. France - 1054
3. Switzerland - 857
4. Brazil - 847
5. Spain - 463
6. South Korea - 319
7. Ghana - 251
8. Ukraine - 216
9. Australia - 175
10. Czech Republic - 157
11. Croatia - 128
12. U.S.A. - 109
13. Germany - 108
14. Portugal - 96
15. Togo - 94
16. England - 87
17. Argentina - 86
18. Japan - 74
19. Tunisia - 52
20. Netherlands - 47
21. Saudi Arabia - 43
22. Ecuador - 37
23. Sweden - 36
24. Mexico - 34
25. Paraguay - 24
26. Ivory Coast - 22
27. Angola - 20
28. Iran - 11
29. Poland - 11
30. T & T - 10
31. Serbia and Montenegro - 8
32. Costa Rica - 6
As you can see, many of these ratings don't match the public perception of how well various teams did. The US is ahead of Germany the Ukraine ahead of Argentina, etc. Why? The short answer is sample size. But why did the small sample size have this result?
Because, to put it simply, we have additional information at our disposal in which to gauge team strength that goes beyond their world cup results. We believe Argentina is a stronger side than Ukraine, based largely on information gleaned previous to the tournament. So to the extent that their world cup results do not reflect their overall strength of a side, this in turn affects the ratings of their opponents they faced in the cup. Mexico tying Argentina is a much better result than the Swiss tying the Ukraine, but the above ratings don't see it that way and so both of the former suffer from both of the latter.
IOW, while how well a team played in 2004 has no direct bearing on how they played in 2006, given our limited sample, it does give us hints as to the overall strength of various teams in 2006 that go beyond their 2006 results. Therefore, if there tends to be much consistency in the strength of national teams from year to year (there is) knowing what teams did in 2005, 2004, 2003, 2002 and 2001 does give us more information with which to interpret the results in 2006.
And this holds true for youth teams as well. If Spain or Argentina fails to make it out of the first round of a youth tournament, the teams that beat them probably deserve credit for beating a youth team that has consistently been excellent in those tournaments, rather than simply dismissing the Argentina team as being unusually subpar.
If in a youth cycle Spain rates a 90 and Australia rates a 120, we know those ratings have some error range compared to the actual strength of those teams. My contention is that the error range for those ratings is not random, but rather previous youth tournament results tell us information about the extent and the direction to which those ratings 'miss'. If they are not random, and we assume Spain has a much better chance of having been better than that rating than Australia, and a much lower chance of being worse, then of course the ratings themselves should be changed to accomodate this info. Once the ratings for those teams change, everybody's rating changes.
And so through that prism, it makes sense to use results from 2003 and 2001 to gauge the results from 2005. Without large team strength independence from cycle to cycle or large samples per cycle (and we are without both), how we evaluate what happened in 2005 has much to do with what happened in previous cycles.