Alternative Rating Systems

Discussion in 'Women's College' started by cpthomas, Nov 25, 2014.

  1. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #1 cpthomas, Nov 25, 2014
    Last edited: Nov 25, 2014
    Rather than using the 2014 NCAA Tournament Bracket thread to discuss Elo and other systems, let's use this one. That way, anyone wanting in the future to look back at the Tournament Bracket discussions can go there and anyone wanting to discuss alternative systems can come here.
     
  2. Nacional Tijuana

    Nacional Tijuana St. Louis City

    St. Louis City SC
    May 6, 2003
    San Diego, Calif.
    Club:
    Seattle Sounders
    Nat'l Team:
    United States
    Very cool! I never did stats in high school or college, and compares to some BSers and tweeps, even though I am far and away the soccer nerd of my family, I watch far less soccer than most of you. But I'm always wanted to create my own system. But never got around to it because I didn't know where to start. Don't want it too complicated, but not too basic, either. And fair (whatever that even means. lol)
     
  3. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Kolabear, here's a piece of info for what you've been posting about using Massey as a starting point. I just ran some numbers based on five years of Massey ratings that I now have in hand, 2010 through 2014. In calculating home field advantage based on his numbers, it's worth +0.037 to the home team in his numbering system and -0.037 to the away team. So, the total home field advantage rating spread is 0.074 on the Massey scale. This is not based on all teams, but rather is based on all regular season games (includes conference tournament games) played over the last five years by top 60 teams, so the number should be pretty reliable.
     
  4. kolabear

    kolabear Member+

    Nov 10, 2006
    los angeles
    Nat'l Team:
    United States
    Huh. Really?! If I'm plugging the numbers in correctly (and my conversion to an Albyn Jones scale is admittedly a bastardization), that's only 17 or 18 rating points in the Albyn Jones scale.

    Between equally rated teams, that gives the team with homefield about a (.530) expected win percentage instead of (.500) That seems very low to me...
     
  5. kolabear

    kolabear Member+

    Nov 10, 2006
    los angeles
    Nat'l Team:
    United States
    #5 kolabear, Nov 26, 2014
    Last edited: Nov 26, 2014
    for those interested in an Elo-type of rating system, I"m going to try to illustrate a couple aspects of them, as I understand it, in a very simplified way. I'll use, as some of you have noticed I usually do, Prof. Albyn Jones rating system as an example. Prof. Jones, for several years, compiled and published ratings for college women's soccer.

    1) An Elo rating takes the form of a number which is that team's rating, which can be thought of as a statistical measure of that team's strength when compared with the ratings of other teams. In Prof. Albyn Jones' ratings, a team's rating was a 3 or 4 digit number (an integer). The median was typically around 1350 and ranged from a high usually around 2100 to a low somewhere around 500 or 600.

    2) The numbers in and of themselves don't mean anything; it's the difference between one team's rating and another which matters; and it depends on the scale chosen by whoever designed the ratings. Different rating systems can choose different scales. Prof. Jones used a scale where a 100 rating point difference represented a(.667) expected win probability for the team with the higher rating. 200 points represented a (.800) expected win probability. 300 points represented a (.889) expected win probability and so on.

    So (in Prof. Jones' scale) if Team A is rated 1600 and Team B is rated 1500, Team A is expected to win approximately (.667) of the time (for simplicity, I'm ignoring homefield) If Team B was rated 1400, Team A would be expected to win approximately (.800) of the time.

    for someone like cpthomas who's interested in exploring Elo systems further, and perhaps even implementing one, I would like to point out a couple aspects that I find important, keeping in mind that I'm not a mathematician or statistician - I'm not an expert.

    A) If you have all the teams and their ratings, and you look at the schedule of a team and the ratings of the teams they played, you should be able to calculate or verify, at least approximately, their rating in a fairly simple way.

    Example if Team A played 18 games and went 12-6 and the average rating of their opponents was 1500, their rating should be 1600 or close to it. Their record of 12-6 (.667) corresponds to a 100 point rating difference in the Albyn Jones scale which I'm using as an example. So the average rating of their opponents (1500) + the rating differential given by their win percentage (100) = 1600.

    This is obviously very simplified. In particular, I'm ignoring any weighting of games for being either more recently played or played earlier in the season. But in a simplified manner, this "back of the envelope" calculation should hold true. If it didn't then something would be wrong. I'm quite sure some others with far greater math and stats backgrounds have said this has to be true because it's the condition that the algorithm created to produce the ratings is trying to solve for - that's what the algorithm is trying to do - to produce a list of ratings such that teams rated 100 points higher than teams below them win (.667) of the time, teams rated 200 points higher win (.800) of the time, etc. (Again, of course, for the purpose of illustration I'm using Prof. Albyn Jones' scale)

    B) I think there's an important principle, but one which is not absolutely in use by all Elo-systems, that (generally speaking), you don't lose rating points just by playing a much weaker team even though you win (although you certainly do lose points by playing a much weaker team and losing!). Conversely, you don't gain rating points just by playing a much stronger team even though you lose.

    In chess ratings, this was part of the formula, I believe, although I don't know if it's changed somewhat over the years. By playing much weaker opponents, you wouldn't gain many rating points by winning but on the other hand you wouldn't lose any just because they were bringing down the average of your opponents' ratings.

    Again, this may not be an absolute principle but I believe it has to be in use to a significant degree. It's only common sense that you can't have a high rating just because you played a few really strong opponents nor can you have a low rating just because you played a few really weak opponents (although naturally, if you mainly played weak opponents, of course that will tend to limit how high a rating you can have).

    This principle is what provides the exception to calculating, or verifying, a team's rating in the simple manner I mentioned above --of finding the rating average of their opponents, taking the win percentage and applying the corresponding rating differential per the Albyn Jones scale.

    In the past, when I would do these" re-calculations", I would "throw out" results at the extremes. Specifically I would "throw out" wins against very low rated teams that would otherwise result in a lower rating AND "throw out" losses against much higher-rated teams that would otherwise result in a higher rating. In general, this worked well and resulted in "re-verified" rating estimates that were closer to the actual ratings that Prof Albyn Jones published.


    ***
    simplified example:
    Team A plays 15 games against teams all rated 1600 and goes 10-5 against them. In addition, Team A plays 5 really weak opponents rated 1000, winning all 5.

    If you used all the games to do a simple "back of the envelope" rating calculation you'd get the following average for the opponents' ratings:

    15 x 1600 = 24,000
    5 x 1000 = 5000
    24,000 + 5,000 = 29,000
    average = 29, 000 divided by 20 = 1450
    Team A's win percentage is 15/20 = .750 which corresponds to a rating differential of about 160 points (Prof Jones scale again). So rating = 1450 + 160 = 1610

    However, that rating estimate is obviously being dragged down by the games against the opponents rated 1000. If you "throw out" those games you get just the games against the teams rated 1600:
    average of opponents = 1600
    win percentage = (.667) which corresponds to a rating differential of 100 points
    Rating estimate = 1600 + 100 = 1700

    "throwing out" those results to the low rated teams makes a significant difference obviously and it only makes sense to do so to get a true sense of Team A's playing strength.
     
  6. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    What my system for calculating home field advantage does is test different rating adjustments for games between closely rated teams. It considers the 600 most closely rated games and the 1200 most closely rated games. If the adjustment is too small, the home team wins more games than its game-site-adjusted rating says it should. If the adjustment is too large, the away team wins more games than it should. What I look for is the adjustment in the middle, where with the adjustment, home and away teams perform the same -- in other words, they perform as their game-site-adjusted ratings say they should be expected to perform. I don't think there's anything wrong with my system, in fact I have a lot of confidence in it.

    My calculations are based only on games involving the top 60 teams (since those are the ones that matter for NCAA Tournament purposes). If your expected win percentage for the adjustment my system comes up with is less than you expected, perhaps it confirms something I have felt for a long time based both on personal observation and on looking at the numbers: When high level teams play each other, home field advantage, at least in Division I women's soccer, is not that great. Yes, it matters, but all the talk about the disadvantage of having to travel (for example, among Big 10 teams) most likely is overstating the disadvantage.
     
  7. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Kolabear, regarding your last post on this thread, based on the formulas I've seen for Elo systems, your point B always is the case. In other words, your rating does not change if you beat teams you're supposed to beat and lose to teams you're supposed to lose to. The only time your rating changes is when there's a difference between the result your current rating says you should get and the result you actually get. In this respect, an Elo system is completely different than the RPI, although its generally true even for the RPI that if you play a much stronger team than you are and lose or play a much weaker team than you are and win, your rating change will be fairly low.
     
    kolabear repped this.
  8. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    I went through my system for determining home field advantage in terms of rating points and used a little better group of closely rated teams (enough to give a reasonable sample at the most closely rated opponents end of the scale) and produced a slightly different number for Massey: on average, the home team's rating should be 0.041 higher than its otherwise calculated rating and the away team's rating should be 0.041 lower, for a total rating advantage from home field of 0.082.

    I then looked at the spreads in Massey's ratings (the difference between the high rating and the low rating) for the five years from 2010 to 2014 and determined his average spread which is 3.762. I did the same for Jones for the three years from 2007 to 2009 and determined his average spread which is 2197. (For Jones, I also have 2006 ratings, but they are from a special run he did for me that, instead of starting from last year's ratings, started all teams with a common prior of 1400, and although the spread was in the range of the other three years, I was a little reluctant to use it because of its use of a common prior.)

    I then determined what Massey's 0.082 total rating from home field advantage would mean on Jones' scale. It comes out to 47.8 Jones points, which is just about what Jones says is the rating point value of having home field.

    Actually, Jones over time identified three different values for home field. In his system description, he said:

    "The average home field advantage in collegiate soccer is around 50 rating points for both men and women. Thus if the home team is rated 50 points higher than the visitor, then the effective rating difference is about 100 points."
    On the other hand, in his regularly published ratings in 2007, he stated that "the average home field advantage for women's teams is 60 rating points." And in 2009 in his regularly published ratings, he stated that "the average home field advantage for women's teams is 55 rating points." It's not surprising to me that the number he assigned to home field advantage changed within the range of these numbers. It's probably due to how the data came in from season to season.

    In this context, the rating points my system identified for Massey of 0.082, converting to 47.8 for Jones, is right about where it should be.
     
  9. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #9 cpthomas, Dec 1, 2014
    Last edited: Dec 1, 2014
    I'm going to use this post to archive some of Professor Jones' statements about his Elo system. Some of the information is in response to a discussion on the 2014 NCAA Tournament Bracket thread, but I think this is a better location for this more generic information.

    First, from Jones' "Rating System Description," I already included, in the prior post, some information about the value of home field advantage in his system and how to apply that value in a game between two teams to determine their "effective" rating difference. Here are a couple more bits of significant information from the Description:

    "In the rating tables, the column labeled SE (standard error) is an estimate of the uncertainty in the rating. You should think of the rating as (R+-SE)."

    "The ratings ... depend on what is known as a prior distribution, based on the previous year end-of-season ratings. Early in the season this will tend to produce ratings close to the prior or seed ratings, but as the results accumulate during the season, the rating will be more dependent on the current results and less dependent on the seed values. Most soccer programs have considerable year-to-year continuity, and this helps produce accurate ratings as early as possible in the season. For teams that undergo radical change between years, the ratings will be able to respond to the accumulation of new results during the season, producing accurate ratings for all teams by the end of the season."
    I'll add a comment here related to the immediately preceding paragraph. In the 2006 ratings Jones ran for me with a common prior of 1400, the ratings had Navy ranked at #5 with a record of 21-1-0 and a rating of 1918 but a SE of 141. In a private email to me, Jones addressed the Navy rank as follows:

    "I ran [the 2006 data]giving all teams the common seed rating of 1400. The most notable difference [from his published ratings] at the top of the table is the presence of Navy. Note the unusually large standard error for Navy: there is a lot of uncertainty. Navy is a good example of the advantage of informative seeding. No way is Navy the number 5 team in the country. They played a weak schedule, with an overabundance of home games. They have one quality win (Penn State), and a loss to Bucknell. .... There is, by the way, serious statistical theory supporting my faith in the seeded ratings."
    Apart from the importance of starting with an "informative seeding," one of the things I get from this is that the starting seeding continues to have an effect on ratings at the end of the season. That conforms to the kinds of formulas that Elo systems use. In fact, although the effects of past seasons diminish towards zero over time, from a purely mathematical point of view they never actually reach zero. The following, from a private email, confirms this:

    "The seeding doesn't totally wash out during the season, but at the end it matters only in the low order of digits -- I have done validation studies starting all teams at the same seed for comparison. Thus the seeding really helps in the early part of the season, and doesn't matter much when there is more data."​

    And more regarding SE, from a private email:

    "You are right, SE represents the standard error of the estimated rating, which is indeed just another name for the standard deviation of the sampling distribution of the estimate. The interval R +/- SE should contain the correct rating 68% of the time; R +/- 2*SE should be right about 96% of the time. Actually, both statements depend on the rating having roughly a normal (bell-shaped curve) distribution. That will be less accurate for teams with extreme records (all wins or all losses)."
    To give an idea of the size of the SEs, here there are for the top 10 teams at the end of the 2007 season in the format rating/SE:

    1. UCLA 2044/84
    2. North Carolina 2025/86
    3. Portland 1977/85
    4. Penn State 1901/72
    5. Southern California 1893/73
    6. Texas A&M 1892/75
    7. Stanford 1880/68
    8. Notre Dame 1879/75
    9. San Diego 1861/71
    10. Florida State 1855/69
    As you can see, at the outer edge, the SEs are larger. Inside the top few, however, the SEs generally run in the same range. Thus the high and low SEs for the teams ranked from 51 through 60 were 73 and 64 respectively.

    And, regarding the difficulty of rating teams at the extreme ends of the ratings, from a private email:

    "The best and the worst teams are the hardest to rate accurately: if you win or lose all your games, it is not easy to tell how much better or worse than everyone else you are!"​

    To illustrate:

    In 2008, #1 Notre Dame's 23-0-0 record produced a rating of 2147 with a SE of 108; #2 UCLA's 20-0-2 record a rating of 2127 with a SE of 95; and #3 Portland's 19-1-0 record a rating of 2116 with a SE of 99.

    In 2009, #1 Stanford's 18-0-0 record produced a rating of 2272 with a SE of 114; and #2 Portland's 16-1-0 record a rating of 2105 with a SE of 112.​

    Also from private emails:

    "Small differences between teams, say 10 or 20 rating points, mean that the teams are indistinguishable statistically, regardless of the difference in rank."

    "[This year we have three teams] in positions 3, 4, and 5. The ratings differ by 10 points or less, which means that statistically speaking, we don't have any evidence that one is better than the other. The rank order is completely random. .... [T]he difference between ratings and rankings is more striking in mid-table. [The following] teams are roughly equal based on their ratings: [He then lists the teams ranked 70 through 86, with the best ranked team having a rating of 1559 and the poorest a rating of 1532, for a total top to bottom rating gap of 27 points]."
    And, from another private email, in response to some information I had provided to him about the problem his and other systems had properly rating teams from the different geographic playing pools within a single system:

    "1) since most teams play most of their games within the region, there is always less information to maintain calibration across regions than within regions.

    "2) .... I haven't checked the calibration recently, but last time I did, it was fine. On the other hand, I didn't check for between-region variation, so there could be something there."​
     
  10. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    One further piece of information about Jones' Elo system:

    When he ran his 2006 "common prior" numbers for me, he assigned each team a common prior of 1400. This appears to be where one would start, in his system, if the system were to be "unseeded." This is supported by the median rating of his system for each of the 2006 through 2009 years:

    2006: 1420
    2007: 1386
    2008: 1355
    2009: 1357​
     
  11. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Just a teaser --

    Coming tomorrow or Wednesday, a comparison of which of these systems have the best correlations between ratings and actual game results and which do the best at rating conferences and regions within a single national system:

    Massey
    Jones
    Iteration 5 URPI
    Improved ARPI
    NCAA's ARPI
     
  12. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #12 cpthomas, Dec 9, 2014
    Last edited: Dec 10, 2014
    OK, for all of you hordes (maybe all three of us) who are really interested in how the different rating systems stack up, here's how they compare -- well, here's how Massey, Jones, Iteration 5 URPI, CPT's Improved ARPI, and the NCAA's ARPI stack up. You'll notice that I'm not including Bennett -- that's because Bennett publishes rankings but not ratings and it's impossible to do a legitimate comparison of systems unless I have ratings.

    First, some preliminary business:

    For the three RPI-related systems, I have eight years' data (2007 through 2014), meaning a data base of just under 25,000 games. This makes the RPI-related numbers the most reliable.

    For Massey, I have five years' data (2010 through 2014), meaning a little over 15,000 games.

    For Jones, I have three years' data (2007 through 2009), meaning just under 9.000 games. (Jones' data is missing some games since he used SIS, and during that time period there were some teams playing Division I that were not in SIS's Division I data base. I don't think this matters for comparison purposes.) Jones' numbers are the least reliable due to the smaller data base and "yes," even with 9,000 games it can be difficult to do really good system comparisons. Nevertheless, I think the information I'll provide below is pretty reliable.
    Second, all I'm interested in and therefore all I'll cover is how the systems compare so far as how well their end of regular season (including conference tournament) ratings correlate with regular season (including conference tournament) game results. I don't care how well they do at predicting future game results (including results of the NCAA Tournament).

    Third, the correlations I'm going to report are after taking game locations into account. The way my correlation system runs, I identify the average adjustment for the home team's and the away team's rating that is needed in order to incorporate the effects of the game location for a game into the team's ratings as applied to that game's result. For example, for the Adjusted RPI, the appropriate rating adjustment is 0.006, meaning that for a game I adjust the home team's ARPI upward by 0.006 and the visiting team's ARPI downward by 0.006. Or, put differently, for the ARPI home field advantage is worth a rating difference of 0.012. For the five systems I'm comparing, here are the adjustments:

    Massey: 0.041 (HFA worth 0.082)
    Jones: 34 (HFA worth 68)
    ARPI: 0.006 (HFA worth 0.012)
    Iteration 5 URPI: 0.006 (HFA worth 0.012)
    CPT's Improved ARPI: 0.007 (HFA worth 0.014)
    Kolabear probably will notice a couple of things. First, he'll notice that my tests say that HFA for Division I women's soccer for 2007-2009 using Jones' system was 68 rather than the number Jones gave, which variously was 60, 55, or 50. My 68 is not that far off from his 60, which he gave for 2007. His numbers' differences from mine probably, more than anything, indicate that the measurement of home field advantage is not an exact science. For example, it depends, when looking at closely rated games, how close you want the games to be which in turn affects the size of the data set you are looking at. Second, he may notice that 0.041 on Massey's scale using his system is not the same as 34 on Jones' scale using Kolabear's Massey/Jones Elo-attempt system. What this suggests to me is that something is not right in Kolabear's method for converting Massey's numbers into a Jones-like Elo system (or, that Massey's system is not an Elo-like system, if that would make a difference). In any event, although these subtleties may matter to me and maybe to Kolabear, for purposes of the comparisons in this post I don't think they make much difference.

    With that said, here we go:

    Correlation with Game Results

    For this comparison, I look at how well a system's ratings correlated with game results. In other words, did the higher rated team win, tie, or lose. I do an overall look and I also break the numbers down for each system based on the closeness of ratings between opponents. For example, what is the system's correlation for the 1500 most closely rated games. As a general rule, it is the correlation rate for closely rated games that is considered the best measure of how well a system performs.

    Overall Correlation

    Looking first at all games, here are the systems' correlations:

    Iteration 5 URPI: 73.2 correct, 10.7 incorrect tie, 16.1% incorrect loss

    Jones: 73.0% correct, 10.9% incorrect tie, 16.1% incorrect loss

    CPT's Improved ARPI: 72.8% correct, 10.7% incorrect tie, 16.5% incorrect loss

    Massey: 72.7% correct, 10.6% incorrect tie, 16.6% incorrect loss

    NCAA's ARPI: 72.5% correct, 10.7% incorrect tie, 16.8% incorrect loss
    To give perspective to these numbers, a season comprises roughly 3,000 games. So, for a season, a difference of 0.1 % (0.001) is roughly 3 games. Thus comparing Iteration 5 and Jones, with their "correct" correlation difference rate of 0.2%, it means that on average Iteration 5 gets 6 more game results out of 3,000 games "correct" than Jones. The bottom line of this, to me, is that when looking at all games, there is very little difference among the rating systems in terms of how well ratings correlate with game results.

    Correlation for Most Closely Rated Games

    If I look only at the most closely rated games, the comparison gets a little trickier. This is due to the ARPI-related systems having a 25,000 game data base, Massey having a 15,000 game base, and Jones having a 9,000 game base. I've compared the systems' correlation scores for their closest 5% and 10% of games. These are different numbers of games for each system, but it's the fairest comparison. As indicated above, this means the ARPI-related systems' correlation scores are the most reliable and Jones' the least reliable. Nevertheless, I think the comparison is pretty fair:

    Most Closely Rated 10% of Games

    Iteration 5 URPI: 44.6% correct, 16.3% incorrect tie, 39.1% incorrect loss

    Jones: 43.3% correct, 17.9% incorrect tie, 38.8% incorrect loss

    CPT's Improved ARPI: 46.3% correct, 15.6% incorrect tie, 38.1% incorrect loss

    Massey: 44.3% correct, 16.3% incorrect tie, 39.3% incorrect loss

    NCAA's ARPI: 46.1% correct, 15.8% incorrect tie, 38.1% incorrect loss

    Most Closely Rated 5% of Games

    Iteration 5 URPI: 43.2% correct, 15.8% incorrect tie, 41.0% incorrect loss

    Jones: 43.7% correct, 19.2% incorrect tie, 37.1% incorrect loss

    CPT's Improved ARPI: 45.1% correct, 16.5% incorrect tie, 38.4% incorrect loss

    Massey: 42.8% correct, 16.0% incorrect tie, 41.2% incorrect loss

    NCAA's ARPI: 42.9% correct, 17.6% incorrect tie, 39.5% incorrect loss
    Looking at these numbers, I'd say that I don't see a big difference between Massey and Jones -- Jones does better for the closest 5% of games, but Massey does better for the closest 10%. The three RPI-related systems appear to do a little better than Massey and Jones, with CPT's Improved RPI doing the best.

    I've always had a question, however, whether the correlation results would be different for highly ranked teams than for other teams. To answer this question, my correlator also looks at correlation results for the top 60 teams (in each system's ratings). Here are the results:

    Iteration 5 URPI: 77.9% correct; 9.7% incorrect, tie; 12.4% incorrect, loss

    Jones: 77.8% correct; 10.1% incorrect, tie; 12.1% incorrect, loss

    CPT's Improved ARPI: 77.9% correct; 9.8% incorrect, tie; 12.3% incorrect, loss

    Massey: 78.1% correct; 9.1% incorrect, tie; 12.8% incorrect, loss

    NCAA's ARPI: 78.0% correct; 9.6% incorrect, tie; 12.4% incorrect, loss
    For the top 60 teams in this list, a difference of 0.1% represents approximately 1.2 games out of 1200 per year. Thus looking at all games played by top 60 teams, there is almost no difference among these rating systems. Unfortunately, due to the limited number of games, it is not possible at this point in time to do a reasonable test for the most closely rated games among top 60 teams. Of interest is that the correlation rates of all systems are significantly better for games among the top 60 teams than for games involving all teams. This most likely is due to the fact that teams are dispersed in a bell curve fashion with teams' actual strengths having wider disparities at both the highly rated and lowly rated edges of the curve.

    So, to summarize thus far as to correlations of the different systems' ratings with game results, it certainly is not demonstrable that either Massey's or Jones' ratings correlate better with game results than the RPI-related systems, either for all games or for closely rated games. At best, Massey's and Jones' correlations are about the same as the RPI-related systems', and the RPI-related systems' correlations -- based on games among closely rated teams -- actually may be better.

    That, however, is not anywhere near the end of the discussion, as my next post will address how well the systems perform at rating the different conferences in relation to each other and how well they rate the different regional playing pools in relation to each other.​
     
  13. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    For anyone who read the preceding post comparing the different system's rates of correlation with game results, I bit the bullet and did calculations for each system's closest 5% and 10% of games so that I could be doing an apples to apples comparison. I revised the preceding post to show those results. The conclusion is about the same, but the comparison has more validity.
     
  14. Cliveworshipper

    Cliveworshipper Member+

    Dec 3, 2006
    Make that 4 of us.

    I do have a couple questions.

    I'm still trying to digest a couple things...

    1) Does HFS correlate at all to individual team's historical HFA?. I'll note that sone team have much larger HFA's than others.

    2) I'm still wondering how GAA did compared to other criteria in predicting wins. I know you have the data :)
     
  15. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #15 cpthomas, Dec 10, 2014
    Last edited: Dec 10, 2014
    The Home Field Advantage I apply for teams is an average for all teams. I know that Massey computes HFA by team. Jones did it as an average for all teams. To me, trying to compute HFA by team sounds statistically unrealistic -- you have 20 data points in a year, 40 in two years, etc. And, in a major portion of those games, the opponents' strengths are different enough that HFA won't affect outcomes, so that the number of meaningful data points is significantly fewer than 20 or 40, etc. I'm conservative about what I think is enough data, and single team numbers aren't enough to satisfy me.

    Goals Against Average does not work when you're looking at all teams. There are too many differences in teams' strengths of schedule. To illustrate, based on GAA Northeastern this year was the #2 team in the country.

    When you're down to the final four, or maybe final eight, however, GAA appears to be an indicator. In terms of predicting the College Cup winner, it looks like a good predictor -- not that the team with the best GAA will win the cup but rather that a team with a GAA over 0.8 will not win the Cup. I suggested going into the quarter-finals that neither Florida nor Texas A&M were good choices to win the Cup since their GAAs were over 0.8. My suggestion proved correct.
     
  16. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #16 cpthomas, Dec 13, 2014
    Last edited: Dec 17, 2014
    For the five rating systems I described above in terms of how well their ratings correlate with game results, how do those systems do in terms of being able to rate the different conferences' teams within a single national system? Do they do it fairly or is there an identifiable pattern of discrimination?

    I'll start with the NCAA's 2012 ARPI, the system currently in effect. The following chart is based on data for the eight years from 2007 through 2014. It shows how well conferences perform in relation to their ratings as compared to their average ARPIs. The conferences are arranged by average ARPI, with the highest average ARPI conference on the left and the poorest on the right. I'm not going to go into a lot of detail here, as I've done that elsewhere, but it hopefully will be sufficient for me to say that in the chart, if the yellow trend line (closest 3,000 games) in particular, but also the pink trend line (all games), runs horizontally across the chart at the 100% line, then there is not a pattern of discrimination based on conference average rating. If the lines are on a slant, there is a pattern of discrimination.

    With that said, here is the chart for the NCAA's 2012 ARPI:

    [​IMG]

    As the chart shows, the NCAA's 2012 ARPI systematically discriminates against stronger conferences and in favor of weaker conferences. In other words, stronger conferences' teams perform better, on average, than their ratings say they should and weaker conferences' teams perform more poorly than their ratings say they should.

    Here is the same chart for CPT's Improved ARPI:

    [​IMG]

    As this chart shows, CPT's Improved ARPI still discriminates against stronger conferences and in favor of weaker ones, but the discrimination is quite a bit less than for the NCAA's 2012 ARPI.

    Here is the chart for the Iteration 5 URPI:

    [​IMG]

    As the chart shows, the Iteration 5 URPI discriminates less than CPT's Improved ARPI and quite a bit less than the NCAA's 2012 ARPI. In fact, the yellow trend line indicates that the Iteration 5 URPI may discriminate very, very slightly in favor of the stronger conferences.

    And, here is the chart for Jones. It is based on three seasons' games, rather than the above three charts, which are based on eight years' games. Therefore rather than using the closest 3,000 games as a test, it uses the closest 1,500, which is roughly the same proportion of games for Jones as 3,000 is for the RPI variations. Here's the chart:

    [​IMG]

    As the chart shows, Jones shows a slight discrimination, when looking at closely rated games, against stronger conferences and in favor of weaker ones, about on a par with CPT's Improved ARPI and significantly less discrimination than the NCAA's 2012 ARPI. On the other hand, when looking at all games it very, very slightly discriminates in favor of stronger conferences. Jones is not quite as good on this issue as the Iteration 5 URPI.

    And, here is the chart for Massey. It is based on five years' games. Therefore rather than using the closest 3,000 games as a test, it uses the closest 2,100, which is roughly the same proportion of games for Massey as 3,000 is for the RPI variations and 1,500 is for Jones. Here's the chart:

    [​IMG]

    As this chart shows, Massey has virtually no pattern of discrimination based on conference strength, particularly when looking at games between closely ranked teams. It thus is the superior system in terms of avoiding discrimination based on conference average strength.

    My next post will be similar to this one, except that it will address the extent, if any, to which the systems discriminate based on the average strength of the regional playing pools.
     
    kolabear repped this.
  17. kolabear

    kolabear Member+

    Nov 10, 2006
    los angeles
    Nat'l Team:
    United States
    Just by the way, I'm posting some thoughts on the use of FIFA ratings as an Elo rating in the International forum (FIFA rankings). Obviously somewhere where (unfortunately for cpthomas!) the RPI doesn't apply. My main conclusion is we can use the FIFA ratings as a quasi-Elo system for calculating rough win probabilities, estimating relative team playing strength and things like strength-of-schedule.
     
  18. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Maybe you could explain the FIFA system here?

    There is someone who does Elo ratings for national teams: http://www.eloratings.net/
     
  19. kolabear

    kolabear Member+

    Nov 10, 2006
    los angeles
    Nat'l Team:
    United States
    They don't do ratings for the women, though, at eloratings.net -- I noticed that a couple weeks ago and that was one of the reasons why I went through the exercise I did (in the International Forum) to see if the FIFA ratings could be used as a quasi-Elo rating for women. I'm satisfied it works reasonably well from a fan standpoint.
     
  20. Cliveworshipper

    Cliveworshipper Member+

    Dec 3, 2006
    There is a good discussion on the differences between Elo and the FIFA Womens ranking here:

    http://opisthokonta.net/?p=448
     
  21. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    For the five rating systems I've been comparing, the last piece of the puzzle is to answer how those systems do in terms of being able to rate the different regional playing pools' teams within a single national system. Do they do it fairly or is there an identifiable pattern of discrimination?

    As I've discussed elsewhere, the regional playing pools have no formal status. Rather, they are five primarily geographic pools of teams. Each team in a pool plays a majority of its games against other teams in the pool, except for a handful of teams that don't play a majority of their games in a single pool. For those few teams, a team is in the pool in which it pays the plurality of its games. To give an idea of the extent to which there are inter-region games, in 2014 the Middle regional pool's teams played 75.2% of its games against Middle region opponents, the Northeast pool 82.6%, the Southeast pool 75.5%, the Southwest pool 73.9%, and the West pool 81.8%. With the regions being this pool-centric, the question is how the limited linkage between pools affects each system's ability to work as a national ranking system.

    I'll start with the NCAA's 2012 ARPI, the system currently in effect. The following chart is based on data for the eight years from 2007 through 2014. It shows how well the regional pools perform in relation to their ratings as compared to their average ARPIs. The regions are arranged by average ARPI, with the highest average ARPI region on the left and the poorest on the right. I'm not going to go into a lot of detail here, as I've done that elsewhere, but it hopefully will be sufficient for me to say that in the chart, if the yellow trend line (closest 3,000 games) in particular, but also the pink trend line (all games), runs horizontally across the chart at the 100% line, then there is not a pattern of discrimination based on regional pool average rating. If the lines are on a slant, there is a pattern of discrimination.

    With that said, here is the chart for the NCAA's 2012 ARPI:

    [​IMG]

    As this chart shows, the NCAA's 2012 ARPI, currently in effect, does a poor job of rating teams from the regional pools in a single national system. It discriminates against teams from stronger regions and in favor of teams from weaker regions. The extent of the discrimination is large.

    Here is the chart for the Iteration 5 URPI:

    [​IMG]

    As this chart shows, the Iteration 5 URPI also discriminates against teams from stronger regional pools and in favor of teams from weaker regional pools. Although its discrimination still is significant, it is significantly less than for the NCAA's 2012 ARPI.

    Here is the chart for CPT's Improved ARPI:

    [​IMG]

    This chart shows that CPT's Improved ARPI goes one step better than the Iteration 5 URPI, although it still discriminates against stronger regional pools and in favor of weaker regional pools. And, it is quite significantly better than the NCAA's 2012 ARPI.

    Here is the chart for Jones:

    [​IMG]

    As this chart shows, Jones is a little better than CPT's Improved ARPI for closely rated games and is just about perfect when looking at all games. Since the most closely rated games are the best test, Jones, as stated, is a little better than CPT's Improved ARPI, is better than the Iteration 5 URPI, and is very significantly better than the NCAA's 2012 ARPI.

    Finally, here is the chart for Massey:

    [​IMG]

    This chart shows that Massey is very close to perfect in rating the regional pools' teams fairly in relation to each other. Based on the games between the closest rated opponents, it very, very slightly discriminates in favor of stronger regional pools and against weaker ones, but the extent of discrimination is minimal, is better than for Jones and CPT's Improved ARPI, is much better than for the Iteration 5 URPI, and is much, much better than for the NCAA's 2012 ARPI.

    In an upcoming post, I'll give some of my personal thoughts about what this all means.
     
  22. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #22 cpthomas, Dec 15, 2014
    Last edited: Dec 16, 2014
    Looking at the three sets of information I provided in preceding posts:

    1. General Correlation of Ratings with Game Results: There really is very little to distinguish among the three variations of the RPI, Jones, and Massey in terms of general correlation of the ratings with the results of the games from which they are derived. Looking at all games for all teams, the correlation rates are very similar ranging, for the frequency at which the higher rated team wins, from 72.5% to 73.2%. For all games involving the top 60 teams, the correlation rates also are very similar, ranging from 77.8% to 78.1%. The only place where I see any possibly meaningful distinction is when looking at games involving closely rated teams: For the 5% most closely rated games, CPT's Improved ARPI may be meaningfully better than the other systems; and for the 10% most closely rated games, CPT's Improved ARPI and the NCAA's ARPI may be meaningfully better than the others. A look at all these numbers, however, suggests to me that whatever mathematical rating system one uses, whether a theoretically and academically correct system or a theoretical and academic outcast like one of the RPI variations, the differences among them, in terms of general correlation of ratings with game results, are going to be only marginal.

    2. Conference by Conference Correlation of Ratings with Game Results: When looking at how well systems rate teams on a conference by conference basis, however, there is a different story. One would expect that different conferences will have different rates at which their results correlate with their teams' ratings. No system is going to be perfect, so if system errors (I'm not speaking here of upsets, but of actual rating errors) occur randomly, then some conferences will end up performing better than their ratings say they should (i.e., are underrated) and other conferences will end up performing more poorly (i.e., are overrated). There will be, however, no pattern to the over- and under-ratings. On the other hand, if there is a bias built into the rating system, one of the places it may show up is in an identifiable pattern of conference over- and under-ratings.

    The conference charts show a pattern of over- and under-ratings for the RPI variations and demonstrate that the RPI has a built-in bias against stronger conferences and in favor of weaker conferences. And, of the variations, the NCAA's 2012 ARPI -- the RPI variation the NCAA currently uses -- is the most biased. Jones also shows this bias, at about the same level as CPT's Improved ARPI, with the Iteration 5 URPI also showing some bias but less than the other RPI variations and Jones. Somewhat surprising to me, however, Massey shows almost no bias. This is somewhat surprising to me because I long have wondered whether there are enough inter-conference games to allow any rating system to avoid a bias against stronger conferences and in favor of weaker conferences. Massey suggests that there are enough.

    3. Region by Region Correlation of Ratings with Game Results: Likewise the regional playing pool charts show a pattern of over- and under-ratings for the RPI variations and demonstrate that the RPI has a built-in bias against stronger regional pools and in favor of weaker pools. And again, of the variations, the NCAA's current variation of the RPI is the most biased. The Iteration 5 URPI comes next and then the Improved ARPI. (It is not surprising that the Improved ARPI is the best of the RPI variations, as part of its formula awards bonuses and penalties based on regional playing pools' relative strength or weakness.) Jones is quite good at rating the regional pools without a bias, and on this one too, Massey is the best. Jones and Massey both show that there are enough inter-region games for a properly designed rating system to avoid a bias against stronger regions and in favor of weaker regions.

    My Editorial Comments: It is inconceivable to me that the NCAA's RPI staff does not know about the RPI's conference and regional playing pool bias problems (independently of the fact that I have advised them of it). I believe it is pretty clear that they either don't care or, I believe more likely, they actually like the bias. The reason they would like the bias is that it helps avoid, even if only a little, NCAA Tournament participants coming from only a small group of conferences or from being overloaded in one region of the country. Of course, doing this intentionally, or even unintentionally, would be inconsistent with the NCAA's announced policy that Tournament participants, outside the Automatic Qualifiers, are to be those teams that have demonstrated through their records that they are the best teams. Nevertheless, I believe it is quite clear and known to the NCAA RPI staff that this is exactly what they're doing. I do believe that this does not make a big difference in who gets at large selections and in seeds. On the other hand, I believe it does make some difference (and I have provided evidence of that elsewhere). In any event, at a minimum, the NCAA simply doesn't care that the RPI is biased.

    Regarding Massey and Jones, both rely on starting teams, for a new season, at a prior rating. Essentially, their systems continue from year-to-year rather than starting from scratch each year. The NCAA's RPI staff has explicitly rejected the use of prior ratings. What Massey and Jones will tell you, however, is that the use of prior ratings, from a statistical perspective, actually makes their systems better at measuring how teams have done this year. Or, put differently, it makes their systems better than the RPI at measuring teams' performance over the course of this season. Since the sole purpose of the NCAA having a rating system is to measure how teams have performed over the course of this season, it should not matter to the NCAA how a system gets there. If it actually does the best job of measuring, the NCAA should use it.

    I believe the NCAA's real reason for opposing the use of prior ratings is a fear that people will think this is unfair and that what a system with prior rating start points is measuring is performance over past seasons rather than performance during this season. They won't understand that the use of prior ratings actually helps a system do a better job of measuring performance during this season. I find this ironic. Because people are ignorant about the role of prior ratings in rating systems, the NCAA won't consider using a system that uses prior ratings. In other words, because people are ignorant, colleges' sport-regulating body won't do what's right. The alternative, of course, would be to lead people out of ignorance through education. The irony is that colleges' representative body rejects that approach to the problem.

    There is, however, another problem with Massey and Jones, which is that they are proprietary systems. In other words, at least aspects of them are secret. For decades, the NCAA itself kept part of the RPI formula (the bonus and penalty part of the formula) secret, so in the past that might not have been a legitimate objection from the NCAA. Currently, however, although not publicly announced, the NCAA provides enough information that the public can know all aspects of the formula. I think this is a problem for Jones and Massey and for that reason, I do not believe the NCAA should use either of their systems. Nothing, however, prevents the NCAA from developing a new rating system similar to theirs that achieves comparable results -- general correlations just as good as the RPI's without a built in conference and regional playing pool bias.

    The bottom line is that the NCAA is knowingly and intentionally using a rating system that has a built in bias against stronger conferences and regional playing pools and in favor of weaker ones. And, so far, they refuse to change this either because they simply don't care or because they actually like the effects of the bias.
     
  23. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    In the course of comparing different rating systems, I've run across some pretty interesting numbers. They involve the relationships between home field advantage, conference games, and non-conference games.

    As a thumbnail explanation, here's the way my system works for determining the value of home field advantage:

    1. For the RPI (both conference and non-conference games), on average the higher rated team wins approximately 73% of games and either ties or loses approximately 27%. For the Non-Conference RPI, the proportions are approximately 68% and 32%. This is true regardless of the RPI or NCRPI variation, and Jones and Massey match the RPI variations.

    2. To look at the value of home field advantage, I look at the % of games home teams won where they had a higher rating and at the % of games home teams won or tied where they had a poorer rating. In other words, I look at the % of games where the home team did as well as or better than the rating system said it should have done.

    3. What this look shows is that some teams win more games than the ratings say they should win on average and also win or tie more games the ratings say they should lose than is average. I.e., they outperform their ratings. And conversely, some teams under-perform their ratings. The combined "games won that should have been won" and "games won or tied that should have been lost" percentages are above 100% for teams that are outperforming their ratings and are below 100% for teams that are underperforming their ratings.

    4. If home field didn't mean anything, then one would expect that for the RPI, home teams with higher ratings than their opponents would win 73% of games and lose or tie 27%, for a performance percentage of 100%, in other words would meet the all games average; and for the NCRPI home teams would win 68% and lose or tie 32%, again for a performance percentage of 100%. Of course, home field advantage does mean something, so that is not what happens. Instead, for all rating systems the performance percentage for home teams is over 100% and for away teams is under 100%, which demonstrates that home field indeed is an advantage.
    This, however, is where it gets interesting -- at least to me. For the RPI variations, with one exception, the performance percentages for home teams range from 119.6% to 123.9%. On the other hand for the NCRPI variations (NC = Non-Conference), the performance percentages for home teams range from 109.0 to 115.4%.

    For a lot of people this might mean ???? But for me, it says, "Wow! It looks like home field advantage has less value to the home team in non-conference games that it does in conference games." That's pretty interesting and and raises the question, "If it's true that home field advantage is worth more in conference games than in non-conference games, Why?"

    Any thoughts out there?

    [The one exception is for CPT's Improved RPI where the performance percentage for the Improved ARPI is about 113% and for the Improved ANCRPI likewise is about 113%. I don't know why this occurs, although perhaps it relates to the Improved RPI being oriented towards properly rating the different regional playing pools' systems in relation to each other through a system of regional playing pool adjustments, which is the one thing that distinguishes it from the other rating systems.]

     
  24. Hooked003

    Hooked003 Member

    Jan 28, 2014
    One possibility is that conference opponents are more familiar with each other. The fewer surprises in terms of players and styles, the more that small differences such as home-field advantage might matter. One way to test that thought would be to examine OOC match-ups that occur regularly such as Stanford/Santa Clara or Cal/Santa Clara.
     
  25. Cliveworshipper

    Cliveworshipper Member+

    Dec 3, 2006
    Did you also compute neutral field? Lots more of those in non-conference.

    I'll wager many non conference match ups are with nearby schools, so the issues of travel aren't necessarily there.

    Stanford sleeps in their own beds when they play SCU. Several of the big conferences are pretty spread out.
     

Share This Page