Alternative Rating Systems

Discussion in 'Women's College' started by cpthomas, Nov 25, 2014.

  1. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    I'm done with identifying which games from 2007 through 2014 were overtime games (for use in determining whether rating systems should distinguish between regular time wins/losses and overtime wins/losses). Here's a little more detailed information than I provided in an earlier post. The info is, for each year, the % of games that went to overtime, the % that ended in ties, and the % that were overtime wins/losses. I continue to find the consistency of the numbers from year to year remarkable. It seems like the number provide a kind of commentary on the nature of soccer, at least so far as Division I women's soccer is concerned.

    2007: 20.6% of games went to overtime; 11.3% ended as ties; 9.3% ended as wins/losses

    2008: 20.9% overtime; 10.7% ties; 10.2% wins/losses

    2009: 20.5% overtime; 10.4% ties; 10.1% wins/losses

    2010: 20.7% overtime; 10.0% ties; 10.7% wins/losses

    2011: 20.9% overtime; 10.6% ties; 10.3% wins/losses

    2012: 20.9% overtime; 10.2% ties; 10.7% wins/losses

    2013: 22.0% overtime; 11.4% ties; 10.6% wins/losses

    2014: 21.6% overtime; 11.1% ties; 10.5% wins/losses
     
  2. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    I'm in the process of doing a gazillion computations involving different rating systems, the most important of which, to me at the present, are variations of Elo systems. And, a gazillion is not that much of an exaggeration, I just changed an Elo system formula in a way that involved changing just short of 3 million numbers (changing a home field advantage from being worth 66 points in the rating system to being worth 72). It's going to take a while, since part of what I'm doing will require me to go back and do a bunch of re-calculations, but it's fun.

    In the meantime, I'm wanting to have a little discussion about the scientific method as it relates to rating systems. I'm hoping kolabear will pitch in, and maybe we'll even get some comments from Gilmoy the scientist.

    So, here's what I'm wanting to discuss. Real professors will say that the RPI is a fanciful rating system, as there's no legitimate theory that underlies it. And, they'll say that a system such as an Elo system is a theoretically correct system. So, they pooh pooh the RPI and opt for an Elo-type system or some other system that appears to have good theoretic underpinnings.

    But, isn't a theory just a theory? In other words, the theory (an Elo-based or other-based system is as reflective of reality as one can get) isn't necessarily reflective of reality. It has to be tested and proved consistent with reality. Until that happens, isn't it just a hypothesis? And, isn't something that is theoretically invalid (the RPI) just theoretically invalid? Doesn't it too have to be tested and proved inconsistent with reality?

    As applied to RPI v Elo-based systems: Suppose the RPI's ratings rating correlate better with game results than an Elo-based (or other "theoretically correct"-based) system. According to the scientific process, doesn't that mean that the RPI is a better reflection of reality than the Elo-based or other-based system? And further, shouldn't the system that correlates better be considered the more-scientifically established system for the sport whose results it's measuring? Isn't that what the scientific method is all about?

    And, I have a caution for commenters. Your comments need to work both ways. Suppose the RPI correlates better with game results. That's one possibility. Alternatively, suppose the Elo-based or other system correlates better. That's another possibility. Doesn't the logic need to be the same either way?

    This isn't necessarily an idle discussion. I'm trying to frame how I might describe the issue to the Women's Soccer Committee (or, ideally, to a group of math science grad students where each member of the Committee has designated one of the students).
     
  3. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    It may turn out that my hunch is wrong, but as I go through my work with an Elo-type system, I've fortuitously seem something that looks weird. It suggests:

    1. In games completed in regular time, home field advantage does not play a part. In fact, home filed advantage may be a very slight negative in regular time games.

    2. But, statistically, there is a home field advantage. What my numbers are suggesting is that it's in overtime games (exclusively?) that home field advantage shows up.​

    This will take a lot more work before I can have any confidence about what I've just said, but it's what my numbers are saying so far.
     
    kolabear repped this.
  4. Soccerhunter

    Soccerhunter Member+

    Sep 12, 2009
    Curious and interesting!
     
    kolabear repped this.
  5. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Here's a little interim report on my work with Elo-based systems in comparison to other rating systems. With the current Elo-based system I've been looking at, I've tried several versions and as part of that have included a version that treats all overtime games as ties even though they actually are won or lost in overtime and another version that treats an overtime win as half a win and an overtime loss as half a loss. I ran those overtime game experiments because it made sense to me that if a game goes to overtime and then is won or lost, it indicates the two teams are very close and treating the games as ties or at least half wins and half losses might yield more accurate ratings than simply treating them as wins and losses. It was a nice theory, but I got a negative result -- ratings are more accurate, at least with the Elo-based system I'm testing, if I treat them as wins and losses. So, I disproved my hypothesis, which I guess is a perfectly satisfactory result if I have the correct scientific mindset.

    For comparing systems, I use a mathematical grading system designed to be objective and to avoid the influence of personal preferences. It weights system accuracy (How frequently do results match ratings?) and system fairness in relation to conferences and regions (How close do teams's results come to their ratings when looking at teams by conference and by region?) equally. Within the system accuracy portion of the grading system, I consider both system accuracy for all games and system accuracy for games involving the Top 60 teams with the two weighted equally. Within the conferences and regions portion of the grading system, I look at general fairness and at discrimination based on conference and region strength with the two weighted equally. I also have two ways I look at general fairness with the two weighted equally. Overall, I think it's a pretty good grading system that considers the things one should want to consider in comparing rating systems and that balances those things well.

    The Elo system I've looked at so far in detail uses a K factor of 40. The K factor determines how much value to assign to a game in relation to the two opponents' ratings going into the game. I'll need to do more work trying other K factors, so there's lots more work to do.

    But, with all of that, here's how the different systems I've worked on currently stack up:

    Massey is the best

    Next comes the Elo K40 system, some distance behind Massey. Typically with an Elo system, where there are home and away games, one would include an adjustment within the rating formula based on the game location. However, the Elo version that comes in second here does not include a game location adjustment. It also treats overtime wins and losses as full wins and losses.

    Third come a group of systems, some distance behind Elo. They are, in order, Jones, Elo OT Half (OT Half means the formula treats an overtime win as half a win and an overtime loss as half a loss), Iteration 5 URPI, Elo HAN OT Half (the HAN meaning that the formula takes game location into consideration), and Elo HAN.

    Fourth comes another group of systems, some distance behind the third group. They are, in order, Elo HAN OT (OT means the formula treats an overtime win or loss as a tie), my Improved ARPI, and Elo K30.

    Next comes a final group, a relatively large distance behind the fourth group. They are, in order, the 2010 ARPI, 2009 ARPI, my Pure URPI, and the 2012 ARPI.

    You'll note that the version of the ARPI the NCAA currently is using -- the 2012 ARPI -- is the worst of all the systems.

    Why does the Elo K40 system perform better when the formula does not include a game location adjustment? I've demonstrated that systems, absent some compensating adjustment, have trouble rating the conferences and regions in a single system due to the limited number of games among teams from the different conferences and different regions. And, I've demonstrated that this tends to cause the systems to underrate teams from stronger conferences and overrate teams from weaker conferences. I've also demonstrated that teams from stronger conferences and regions tend to have favorable home field imbalances. And, I've theorized that these favorable home field imbalances at least partly cancel out, in the ratings, the rating systems' tendency to discriminate based on conference and region strength. The fact that the Elo K40 system performs better when its formula does not include a game location adjustment seems to support my theory -- the discounting of home field advantage balances out discrimination among conferences and regions and thus provides a better rating system.
     
    Hooked003 and kolabear repped this.
  6. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Here's a question I'd love to hear some thoughts about:

    Any mathematical rating system is going to be imperfect when it comes to ratings correlating with game results. My experience says that for Division I women's soccer (and excluding systems that consider only non-conference games), ratings are going to correlate with game results a little over 70% of the time. The exact correlation rate will vary from system to system, but all will be in roughly the 71 to 73% range. Since there are about 3,000 games per season, a difference in correlation rate of 0.1% means the "more accurate" system got 3 more games "right" per year out of the 3,000 games played; 1% means 30 more games right; and 2% means 60 more games. Generally, when I talk about accuracy, this is what I'm talking about.

    So, conversely, any mathematical system also is going to get 27 to 29% of the games wrong. One of the questions I've focused on is, "Which games is a system getting wrong?" Ideally, from my perspective, the games a system is getting wrong will be randomly distributed. I would call a system in which those games are distributed perfectly randomly the most "fair" system, looking at fairness as something separate from accuracy. In that context, what if the distribution of those games is not distributed randomly? This is a question I've focused on, and specifically on the question of whether a rating system systematically distributes those games not randomly but rather in relation the pools within which teams play -- either conference pools or regional pools -- including (but not limited to) as a fairness "subset," in relation to the strengths of the pools within which teams play. (There might be other ways in which the distribution of "wrong" games could occur systematically, but if there are I haven't figured out what they might be.)

    So, here's the question: How does one balance the desire for accuracy against the desire for fairness? Is one more important than the other? And, more specifically, if one is willing to trade off some of one for the other, how much?

    I'm asking this question because something I think I may be seeing is that accuracy and fairness don't go together. It's looking to me like it's possible that if I want to maximize accuracy, I have to give up some fairness; and if I want to maximize fairness, I have to give up some accuracy. If both accuracy and fairness are important, what's the right balance?

    Any opinions?
     
  7. Hooked003

    Hooked003 Member

    Jan 28, 2014
    Why would anybody ever want random errors? At best, you would have to qualify every rating with a +/- to account for the possibility of the errors. For example, would a bettor ever prefer random errors to systematic errors? I don't believe a bettor would prefer random errors.

    One one level, it would appear to me that systematic errors are much more desirable errors, provided one can do the amazingly difficult job of identifying and quantifying them (which is what you seem to be attempting to do--props to you). With identified systematic errors, one can take them into account to reduce their impact on the choice a bettor makes or a selection committee makes.
     
  8. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #58 cpthomas, Apr 29, 2015
    Last edited: Apr 29, 2015
    Yes, you have a good point if the NCAA were willing to make adjustments to take systematic errors into account. In fact, I've developed a rating system that does exactly that -- it provides for rating adjustments based on which regional playing pool a team is in and the average pre-adjustment rating of the teams in that region. However, the regional playing pools are not recognized by the NCAA and given other NCAA positions it's almost impossible to believe the NCAA ever would adopt a system with regional adjustments.

    The other aspect of this is that in part you're thinking of bettors, who are interested in predicting the outcomes of future games. On the other hand, for NCAA purposes rating systems are not to predict the outcomes of future games but rather are to measure past performance. These in fact are two different things.

    But still, you're right if one is allowed to make adjustments within the rating system to offset the effects of systematic errors. If you can do that, then you've eliminated some errors. Interestingly, however, the same questions remains: you're still going to have errors in the 27 to 29% range. In other words, you may have eliminated systematic errors but in doing so you may have introduced more random errors, and even more than you've eliminated. Is that a worthy trade-off?
     
  9. Hooked003

    Hooked003 Member

    Jan 28, 2014
    If one is only looking backwards from end of regular season and conference championships, then I could accept random errors over systematic errors. My thinking is that, over time (i.e., many years), the random errors would balance out (at least somewhat) whereas the systematic errors would always be present.
     
  10. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    I think you're right that the random errors will balance out over time -- they will if they're truly random. And, the systematic errors definitely always are present. I've done yearly analyses as well as analyses of the last eight years combined and the same pattern always occurs whether looked at by year or over time.

    Nevertheless, your thoughts about systematic errors are interesting, if one can get the Women's Soccer Committee to take them into consideration. A problem is that the NCAA's RPI staff never has been willing to admit that the RPI systematically discriminates both in relation to conference strength and in relation to region strength. That makes it hard for Committee members to take discrimination into account, at least in an overt way. Plus, some of the Committee members' constituents are favored by the discrimination.

    The good news is that the RPI isn't the only input the Committee considers in making at large selections and seeding teams. The bad news is that the RPI's systematic discrimination may be causing some teams to not even get into the at large bubble or into the seed candidate groups when they at least should be getting consideration for at large selections or seeds.
     
  11. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    I don't want to start a new thread, but I've just finished a big project looking at a bunch of different rating systems for Division I women's soccer, and this thread title seems appropriate, so here come a bunch of posts on the RPI and alternative rating systems. I'll offer both basic information and my own opinions. The information and opinions are based on game results in the nine seasons from 2007 through 2015. That's a total of 27,841 games or just under 3,100 per year. Of those games, 8,431 involved at least one team ranked in the Top 60 in the NCAA Adjusted RPI rankings or right about 935 per year. The Top 60 is an important group, since all NCAA Tournament seeds and at large selections historically (over the last 9 years) have come from that group.

    In evaluating rating systems, I consider the following the most important and reliable measures of how rating systems perform:

    Overall correlation of ratings with results

    This looks at the following, after making rating adjustments based on game locations: What percentage of games did teams win where the ratings said they should win? What percentage did they lose where the ratings said they should win? What percentage did they tie?

    Although in my work I break this down into groups of games based on how closely rated the teams were, for reasons of data limitations I believe looking at all games, regardless of rating difference, is the best way to look at correlations of ratings and results.
    For the Top 60 teams, correlation of ratings with results

    This looks at the same questions as the overall correlation measure, but only for games involving at least one Top 60 team.
    For regional playing pools, how well the teams in each regional pool, on average, perform in relation to their ratings

    The regional playing pools are groups of teams I've derived based on primarily regional groupings within which teams play either the majority or the plurality of their games. There are five such groups: Middle, Northeast, Southeast, Southwest, and West. The playing pools have no NCAA status or recognition.

    To measure how well a region's teams perform, I use a computerized system described here: Performance by Groups of Teams in Relation to Their Ratings. The system produces a performance percentage for each region. If the performance percentage is 100%, then the region's teams performed exactly as one would expect a "normal" or "average" region's teams to perform based on its teams' ratings. If the performance percentage is over 100%, then the region's teams are performing better than normal, which means that the rating system is underrating them on average. If the performance percentage is below 100%, then the region's teams are performing more poorly than normal, which means the rating system is overrating them on average.

    When I look at this, I look at three subsets of information:

    1. What is the difference between the performance percentage of the region that is most underrated and the performance percentage of the region that is the most overrated? This is a measure of how discriminatory the system is to the regions to which is is most disadvantageous and most advantageous.

    2. What is the cumulative amount by which the performance percentages of all the regions either exceed or are less than 100%? This is a measure of how close the system comes to getting the regions' ratings exactly right.

    3. Is there a trend to the regions' ratings, in relation to their average ratings when ordered from the region with the highest average ratings to the region with the poorest average ratings? This is a measure of whether, and if so how much, the system discriminates among regions in relation to the regions' teams' strength.
    For conferences, how well the teams in each conference, on average, perform in relation to their ratings

    For this, I use the same performance measurement system and look at the same three subsets of information as I do for the regions, but for conferences instead.
    To get a start, I'll show how a number of NCAA Division I women's soccer RPI rating systems perform. For each rating system, I've applied that system to all of the 2007 through 2015 game results and then run the system's resulting ratings through my correlation system to see how the rating system performs.

    Unadjusted RPI

    The URPI is the backbone of the NCAA's rating system. Here's how it performs:

    Overall correlation:

    Results match ratings: 72.6%

    Top 60 teams' results match ratings: 78.0%
    Regions:

    High to low performance spread: 32.2%

    Cumulative variance from "normal": 39.5%

    Discrimination based on regional strength: 20%

    (A positive percentage represents discrimination against stronger regions; a negative percentage would represent discrimination against weaker regions. 0% would mean the rating system does not discriminate in relation to regional strength.)
    Conferences:

    High to low performance spread: 30.0%

    Cumulative variance from "normal": 158.9%

    Discrimination based on conference strength: 18%
    2015 ARPI

    This is the system the NCAA currently uses. Here's how it performs:

    Overall correlation:

    Results match ratings: 72.6%

    Note: This is the same as for the URPI. If there were a difference, each 0.1% would represent about 3 games a year in which the rating system with the higher percentage got the right game result and the system with the lower percentage didn't.​

    Top 60 teams' results match ratings: 77.9%

    Note: This is 0.1% poorer than the URPI. For the Top 60 teams, each 0.1% represents about 1 game a year difference in how the two systems performed.
    Regions:

    High to low performance spread: 31.7%

    Cumulative variance from "normal": 38.9%

    Discrimination based on region strength: 20%
    Conferences:

    High to low performance spread: 28.6%

    Cumulative variance from "normal": 163.2%

    Discrimination based on conference strength: 17%
    Opinion: The differences between how the URPI and the 2015 ARPI perform are inconsequential. Given the ARPI that the Women's Soccer Committee currently is using, there's no good reason it shouldn't jettison it and use the URPI instead.

    2009 ARPI

    This is the system the Women's Soccer Committee used from 2007 through 2009. It has higher adjustment amounts than the 2015 ARPI, and it makes adjustments for results of in-conference games as well as non-conference games, whereas the 2015 ARPI doesn't make adjustments for in-conference results.

    Overall correlation:

    Results match ratings: 72.7%

    Top 60 teams' results match ratings: 78.0%
    Regions:

    High to low performance spread: 33.2%

    Cumulative variance from "normal": 39.4%

    Discrimination based on region strength: 20%
    Conferences:

    High to low performance spread: 25.9%

    Cumulative variance from "normal": 143.0%

    Discrimination based on conference strength: 12%
    Opinion: The differences between the 2009 ARPI, the 2015 ARPI, and the URPI are inconsequential except as to conferences. The 2009 ARPI discriminates based on conference strength, but less than the 2015 ARPI and the URPI. In other words, when the Women's Soccer Committee moved to the 2015 ARPI, it did nothing to improve the ratings and made them more discriminatory in relation to conference strength.

    Adjusted Non-Conference RPI

    The ANCRPI produces ratings based only on a team's non-conference games. The NCAA's stated purpose for this is to avoid rating distortions produced by including a team's conference games when rating it. The downside of this is that for a team, its data include fewer games. Here's how it performs:

    Overall correlation:

    Results match ratings: 68.3%

    The result of using fewer games is a sacrifice in accuracy. The ANCRPI ratings miss about 130 games more per year (out of 3,100) than either the URPI or the 2015 ARPI.
    Top 60 teams' results match ratings: 73.2%

    The ANCRPI ratings miss about 47 games more per year (out of 935) than the 2015 ARPI.
    Regions:

    High to low performance spread: 18.9%

    Cumulative variance from "normal": 23.3%

    Discrimination based on region strength: 15%
    Conferences:

    High to low performance spread: 21.9%

    Cumulative variance from "normal": 111.8%

    Discrimination based on conference strength: -1%
    Opinion: The ANCRPI eliminates discrimination in relation to conference strength and does better for conferences based on the other measures, than do the other RPI systems. The ANCRPI still discriminates against regions based on region strength, but does better on this than the other RPI systems and also does better than the other RPI systems on the other measures. The trade off for this, however, is that the ANCRPI's ratings overall and for games involving Top 60 teams are significantly less accurate than those of the other RPI systems.

    The NCAA leaves it to the individual Women's Soccer Committee members to decide how much weight they want to assign to the ARPI version currently in use, to the URPI, and to the ANCRPI. As a matter of policy, however, the NCAA's use of the ANCRPI establishes a precedent for using a rating system that is significantly less accurate than the other RPI systems. Thus if a different rating system were recommended to the NCAA, it would not be legitimate for the NCAA to reject it simply on the basis of it being less accurate than the other RPI systems, so long as it were at least as accurate as the ANCRPI.

    In future posts, I'll move on to other systems. The reason I've covered the NCAA's own RPI variations first is to illustrate how I evaluate systems and to establish a baseline for comparison with other systems.



     
    Soccerhunter repped this.
  12. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Next, I'll tackle Elo-like systems. Arpad Elo originally developed his system to rate chess players. It's worth pointing out that a chess player is the same person throughout his/her chess career. The player's skills may change over time, but the process of change likely will be gradual. When a new player begins competing, the player receives a provisional rating at the median of all players' ratings. This does not become a regular rating until the player has played 30 games. The reason for this is that the chess world considers the rating system to need 30 games in order to move the player from his/her provisional rating to his/her true rating.

    There are a number of variables one can use for an Elo-like system. Examples are game location, score differential where contests produce scores, and contest significance (e.g., for international women's soccer, whether the contest is a World Cup or Olympic contest, a lesser but major tournament contest, or a simple friendly). Another variable (sometimes used in relation to weighting contest significance) is how much weight to assign each game. The higher the weight assigned to each game, the greater the movement to a new rating following each game. The weight assigned to each next game is the K factor.

    I developed a relatively simple Elo-like system for Division I women's soccer. It takes game location into account, but does not use score differential, contest significance, or other similar variables. It doesn't use these because the NCAA has made clear it considers such variables as non-starters for its rating systems. I experimented with different K factors to see what level of K factor performed the best, using my performance measurement system. For Division I women's soccer, a K factor of 70 produces the best performance.

    There are two basic ways you can run an Elo-like system for an NCAA season. One way is to start all teams, at the beginning of the season, with a "common seed." What this means is that at the start of each season, every team starts with the same rating. Tyipcally, that is the median rating in prior years. The other way is to give each team a unique seed based on historic ratings. For a seasonal sport like Division I women's soccer, the most logical starting ratings would be the teams' ratings as of the end of the preceding season. For chess, this is the way it would work if there were such things as chess seasons. It may be obvious, but I'll nevertheless point out that although this seems appropriate for chess, it would be a controversial approach for college soccer. The reason is that for chess the player who ended the previous season is the same as the one beginning the new season whereas for college soccer there has been a turnover of one quarter of the team, on average, from the previous season. Whether this really makes a difference, and if so how much, is debatable. In any event, however, the NCAA has been adamant that they never will start a season with teams having unique seeded ratings. Statisticians who prefer Elo-like systems, on the other hand, will say that this is crazy and that, given enough games, the effect of the starting seeded ratings will disappear (at least mostly) given enough games.

    As a test, and to evaluate an Elo-like system with a common starting seed as applied to Division I women's soccer seasons, I ran Elo ratings for the 2007 through 2015 seasons. I did it using a K factor of 70, since that K factor maximized the performance of my Elo-like system that used the prior year's end-of-season ratings as the beginning point for each succeeding year. Here's how the common starting seed system performed:

    Elo, Common Seed

    Overall correlation:

    Results match ratings: 70.2%

    Top 60 teams' results match ratings: 76.1%

    Note: These correlations are better than the ANCRPI, but not as good as any of the NCAA's RPI systems.
    Regions:

    High to low performance spread: 38.1%

    Cumulative variance from "normal": 69.0%

    Discrimination based on region strength: 31%
    Conferences:

    High to low performance spread: 69.4%

    Cumulative variance from "normal": 466.4%

    Discrimination based on conference strength: 60%
    If you compare these region and conference results to those for the NCAA's different RPI variations, including the ANCRPI, the performance of an Elo-like system with a common starting seed is abysmal. It does a terrible job at rating the different regional playing pools and the different conferences in a single system.

    Why? It does a terrible job because:

    1. Using a starting common seeded rating is based on a fallacy. Teams are not remotely close to being of equal strength at the beginning of a season.

    2. On average, during an NCAA season, teams play just short of 20 games. That number of games is not close to enough to overcome starting with common seeded ratings.
    As applied to conferences and regions, using a common seed rating for teams means that all conferences and all regions start the season considered equal. With each team playing, on average, just under 20 games, there are not enough games to get the conferences and regions to their correct average ratings. Thus the strongest conferences and regions, at the end of the season, are underrated and the weakest are overrated. This, in turn, produces the poor performance numbers for the Elo-like ratings when they are compared to the conferences' and regions' actual performance.

    Now, let's look at how Elo-like systems perform, that don't use common seeds, but that instead use (or most likely use, for proprietary systems) end-of-prior-season ratings as the starting ratings for each succeeding season. I have three such systems to show how they work: my own system, Albyn Jones' system, and Kenneth Massey's system. I'm not sure Massey's system is an Elo-like system, but I suspect it is because of the performance results it produces. I have fewer data for both the Jones and Massey systems, but I believe I have enough data for them to show how Elo-like systems can perform when using starting seeds based on the prior year's end-of-season ratings.

    Here's how they do:

    Overall correlations:

    Results match ratings:

    CPT Elo: 72.7%
    Jones: 72.9%
    Massey: 72.8%
    Top 60 teams' results match ratings:

    CPT Elo: 77.1%
    Jones: 77.6%
    Massey: 78.3%

    Note: Massey's 1.2% improvement over CPT Elo means he gets approximately 12 more games "right" per year that CPT Elo, out of roughly 935 games.
    Regions:

    High to low performance spread:

    CPT Elo: 3.4%
    Jones: 8.9%
    Massey: 3.7%
    Cumulative variance from "normal":

    CPT Elo: 10.0%
    Jones: 13.0%
    Massey: 6.2%
    Discrimination based on region strength:

    CPT Elo: 1%
    Jones: -1%
    Massey: -2%
    Conferences:

    High to low performance spread:

    CPT Elo: 21.7%
    Jones: 22.8%
    Massey: 21.4%
    Cumulative variance from "normal":

    CPT Elo: 100.8%
    Jones: 107.3%
    Massey: 88.2%
    Discrimination based on conference strength:

    CPT Elo: 5%
    Jones: -2%
    Massey: 7%
    Opinion: For practical purposes, the Elo-like systems are as accurate as the RPI systems and significantly more accurate than the ANCRPI. More important, they do not discriminate among regions and conferences in relation to region and conference strength (the % numbers for discrimination are not substantial, when compared to how other systems do). And, they perform much better in relation to the other region and conference measures. Thus unlike the RPI, the Elo-like systems are able to fairly rank teams from the different regions and conferences within a single national system.

    There is, however a trade-off. As demonstrated by the results for an Elo system using a common seeded rating, when a team's starting rating is far from its actual rating, 20 games is not enough to get the team reasonably close to its true rating. This has significance for Elo-like systems that use the prior season's end-of-season ratings as a starting point. From time to time, there will be a team with a very high rating at the end of the prior season but that will drop off a whole lot going into the next season. For this team, its 20 games will not be enough to drop its rating as far as it should. Indeed, it may not even drop the team's rating far enough to knock it out of the rating level at which it ordinarily would be a lock for an at large selection to the NCAA Tournament. UCLA, in the transition from the 2014 to the 2015 season, is an example of this exact case. Conversely, there can be a team with a poor rating at the end of the prior season that that is a whole lot stronger going into the next season. For this team, its 20 games will not be enough to raise its rating as far as it should. And indeed, it may not raise the team's rating enough to put it into contention for an at large selection when it should be in contention. This is a serious trade-off, and in the past the NCAA has been adamant that this is a trade-off it will not make. That has been its justification for refusing to consider rating systems that use season-starting seeds based on past performance. For that reason, the NCAA rejects Elo-like systems as appropriate for use by the Women's Soccer Committee.
     
    Soccerhunter, Cliveworshipper and Gilmoy repped this.
  13. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    In addition to running performance tests for the NCAA's RPI variations and Elo-like systems, I've created and tested some of my own variations of the RPI. I've done this to see if it's possible to create variations of the RPI that perform better. From these, I've selected several that I'll cover in this and succeeding posts. The first one is the CPT Improved ARPI.

    For this RPI variation, I make three changes to the 2015 ARPI:

    1. I go back to the adjustment amounts and approach the NCAA used from 2007 through 2009. This means that the adjustment amounts (bonuses and penalties) are higher; they apply to all games and not just to non-conference games; and they are awarded based on results against teams in the same tiers that the NCAA used from 2007 through 2009;

    2. I use what I call the "pure" RPI. RPI Element 2 for Team A is the average of Team A's opponents' winning percentages against teams other than Team A. For the pure RPI, Element 2 is the average of Team A's opponents' winning percentages (including the results of the opponents against Team A).

    3. I apply a second set of adjustments based on regional playing pools. Teams can receive bonuses, no adjustments, or penalties based on the average rating of the teams within their regions. The purposes of this is to see if it's possible to address the RPI's problem of discriminating among regions in relation to region strength through these regional adjustments.
    Here's how the CPT Improved ARPI performs:

    Overall correlations:

    Results match ratings: 72.7%

    Top 60 teams' results match ratings: 77.9%
    Regions:

    High to low performance spread: 22.6%

    Cumulative variance from "normal": 34.2%

    Discrimination based on region strength: 8%
    Conferences:

    High to low performance spread: 22.2%

    Cumulative variance from "normal": 133.5%

    Discrimination based on conference strength: 10%
    Opinion: The CPT Improved ARPI is as accurate as the RPI formulas the NCAA has used and significantly outperforms them for regions and conferences. It is significantly more accurate than the NCAA's ANCRPI; does better than the ANCRPI for regions due to its better performance there as to discrimination based on region strength; and does somewhat more poorly for conferences than the ANCRPI.

    Realistically speaking, the NCAA is not going to consider a system like this. It does not recognize the existence of regional playing pools nor does it have a history of awarding bonus and penalty adjustments based on region strength. So, although the CPT Improved ARPI indicates that the RPI could benefit from a system of regional bonuses and penalties, the only use of it appears to be to show that such a system could be better than the versions of the RPI the NCAA has been using.
     
    Soccerhunter repped this.
  14. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    The second of my RPI variations is the CPT 5 Iteration URPI.

    I developed this system to address a specific problem of the RPI. The way the NCAA's RPI formula works, the RPI consists of two parts, each with a roughly 50% effective weight. The first part is Team A's winning percentage. The second part is Team A's strength of schedule. The problem my system addresses has to do with the way the NCAA computes strength of schedule. The problem is this:

    Using the NCAA's strength of schedule formula, you can rank teams based on how much they contribute to their opponents' strengths of schedule. And, of course, you also can rank teams based on their ARPI ratings. For a rating system, you would expect that a teams rating system rank would the same as, or at least nearly the same as, its rank in terms of what it contributes to an opponent's strength of schedule, right? Well, for the NCAA's RPI it doesn't work that way. In fact, teams' ARPI ranks, and their ranks in terms of what they contribute to opponents' strengths of schedule, can be quite different. This is part of what causes the ARPI's region and conference discrimination and it creates incentives for teams seeking NCAA Tournament at large selections to attempt to schedule non-conference opponents in ways that will allow them to benefit from this RPI defect.
    To correct this problem, the CPT 5 Iteration URPI works this way:

    1. First Iteration: I compute the Unadjusted RPI just as the NCAA does.

    2. Second Iteration: For each team, I then compute the average URPI of its opponents, which I then use as a new strength of schedule portion of the RPI. I combine the team's winning percentage and the average URPI of its opponents to produce the 2 Iteration URPI. I combine them in a manner that gives the team's winning percentage an effective weight of 50% and its second iteration strength of schedule an effective weight of 50%.

    3. Third Iteration: For each team, I then compute the average 2 Iteration URPI of its opponents and go through the same process to produce the 3 Iteration URPI. Again, I weight winning percentage and strength of schedule at 50% each.

    4. Fourth Iteration: Same process.

    5. Fifth Iteration: Same process.
    With the CPT 5 Iteration URPI, teams' RPI ranks and their ranks in terms of what they contribute to their opponents' strengths of schedule are virtually the same. Thus the NCAA RPI's problem is solved and there no longer is a reason for teams to try to "game" the system based on that problem.

    Here's how the CPT Iteration 5 URPI system performs:

    Overall correlations:

    Results match ratings: 73.2%

    Note: Although the margin of this over the better performing of all the systems is small, this in fact is the best performing system under this performance measure.
    Top 60 teams' results match ratings: 78.1%
    Regions:

    High to low performance spread: 17.4%

    Cumulative variance from "normal": 24.0%

    Discrimination based on region strength: 12%
    Conferences:

    High to low performance spread: 22.7%

    Cumulative variance from "normal": 91.7%

    Discrimination based on conference strength: 6%
    Opinion: This RPI variation is as accurate as the NCAA's RPI versions and performs significantly better as to regions and conferences. For practical purposes, it has minimal conference-related discrimination and quite low region-related discrimination. As for the NCAA's ANCRPI, this RPI variation is significantly more accurate; and is roughly equal as to how it handles regions and conferences. Given this, the 5 Iteration URPI would be a helpful and excellent replacement for the NCAA's ARPI and would dispose of the NCAA's need to use the significantly less accurate ANCRPI to help with the RPI's discrimination problem in relation to conferences. Furthermore, the 5 Iteration URPI meets all of the NCAA's criteria for a rating system and does not violate any of the limitations the NCAA says are "musts" for any rating system it would consider. Simply put, I do not believe the NCAA could come up with a single legitimate reason for not replacing its current version of the RPI with the 5 Iteration URPI.
     
    Soccerhunter repped this.
  15. Soccerhunter

    Soccerhunter Member+

    Sep 12, 2009
    All excellent work. Good reading.
     
  16. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    In my recent posts, I've provided information on rating systems I've reviewed and, for some, developed previously, but that I updated recently with 2015 results. I've also developed some new rating systems. I based these on work that I recently posted about on the NCAA DI Tournament Bracketology thread. That work included establishing scoring systems to measure how teams perform in relation to the NCAA's criteria for at large selections (and seeding) for the NCAA Tournament.

    My hope -- which I thought might be unrealistic -- was that by including some of those scoring systems in a rating system, I would be able to produce a system that outperformed the systems I've described above. I was right in one respect -- my hope was unrealistic. Having done this work, as well as having developed numerous other possible rating systems over the last 10 years, I feel secure in saying this: The rating systems that I've discussed, in terms of measuring how teams have performed over the course of a season, are as good as you're going to get. So far as measuring how teams have performed over the course of a season, mathematical rating systems have limits and the better of the systems I've covered are about at those limits.

    With that said, in the new rating systems I've developed, I included various combinations, in various formulas, of the following:

    2015 ARPI
    Top 5o Results Score (using my scoring system)
    Below 50 Results Score (essentially, a score based on poor results, using my scoring system)
    Head to Head Results v Top 60 Teams Score (using my scoring system)
    Common Opponents with Top 60 Teams Score (using my scoring system)
    Conference Standing Combined with Conference Average ARPI Rank
    (For Conference Standing, I use the average of a team's regular season conference finishing position and its conference tournament finishing position.)
    Of these, to my surprise the best was the simplest: a combination of ARPI and Conference Standing Combined with Conference Average ARPI Rank. The formula is simple:

    Rating = ARPI + 1/(Conference Standing + Conference Rank)
    Here's how this CPT ARPI Modified by Conference Standing and Conference Rank system performs:

    Overall Correlations:

    Results match ratings: 72.7%

    Top 50 teams' results match ratings: 77.2%
    Regions:

    High to low performance spread: 30.9%

    Cumulative variance from "normal": 36.0%

    Discrimination based on region strength: 16.0%
    Conferences:

    High to low performance spread: 20.4%

    Cumulative variance from "normal": 127.9%

    Discrimination based on conference strength: 4%
    Opinion: This system is in the accuracy range of the better systems and is significantly more accurate than the ANCRPI. It is better than the NCAA's ARPI systems for regions, but is not quite as good as the ANCRPI in random variability but about the same as the ANCRPI in discrimination based on region strength. It does not have the ARPI's conference problem, and is roughly equal to the ANCRPI in that respect. Given its significantly better accuracy than the ANCRPI, it would be a good replacement for the ANCRPI.

    I like this system because it is very simple as a modification to the ARPI.

    COMING NEXT: A comparison of all the systems I've recently posted about, in table form.
     
  17. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Here is a series of tables comparing the systems I've recently posted about. They compare the systems based on the performance measurements I've used for the systems in each of the above system descriptions.

    Results Match Ratings

    [​IMG]

    This is how the systems perform, looking simply at the % of times game results match ratings (adjusted for home field advantage). The data sample is 27,841 games or about 3,100 games per season. A difference of 0.1% thus means a difference of roughly 3 games per year a system got correct, that another didn't, out of 3,100 games. As you can see, the CPT 5 Iteration URPI is the winner. But, being the winner is inconsequential until you get to the last two systems on the table.

    Top 60 Teams' Ratings Match Results

    [​IMG]

    This is how the systems perform, looking simply at the % of times the Top 60 teams' game results match ratings (adjusted for home field advantage). The data sample is 8363 games or about 935 games per season. A difference of 0.1% thus means a difference of roughly 1 game per year a system got correct, that another didn't, out of 935 games. Here, Massey is the winner. But, being the winner is inconsequential until you get pretty far down on the table.

    Comment: These two tables show how hard it is to improve the accuracy of mathematical rating systems, at least for Division I women's soccer. What you're seeing in the best performers in these two tables almost certainly is close to the best one can do in terms of accuracy.

    Regions High to Low Performance Spread

    [​IMG]

    This is one measure of fairness in the rating of teams from different regions within a single national rating system. The Elo-based systems do this the best. Then come the CPT Iteration 5 URPI, the CPT Improved ARPI, and the NCAA's ANCRPI. Significantly farther down are the NCAA's other systems. The worst is the Elo system that uses a common starting rating at the beginning of each year.

    Regions Cumulative Variance from "Normal"

    [​IMG]

    This is another measure of systems' fairness as to regions. Again, the Elo-based systems perform the best. Then come the 5 Iteration URPI and the NCAA's ANCRPI. All the other systems perform significantly more poorly, with the Elo, common starting rating system far down in performance.

    Discrimination Based on Region Strength

    [​IMG]

    For discrimination based on region strength, the Elo-based systems perform the best (except for the always poorly performing Elo system using a common starting rating). The CPT Improved ARPI and 5 Iteration URPI come next, then the NCAA's ANCRPI and the combo of the 2015 ARPI and Conference Standing and Conference Rank. The NCAA's other RPI versions are at the bottom, except for the common starting rating Elo system.

    Conferences High to Low Performance Spread

    [​IMG]

    In this measure of fairness as to conferences, the differences are smaller except for the dismal performance of the Elo system that starts with common ratings. Most interesting to me here is the excellent performance of the system that combines the NCAA's 2015 ARPI with Conference Standing and Conference Rank.

    Conferences Cumulative Variance from "Normal"

    [​IMG]

    Here, Massey and the 5 Iteration URPI are the best, followed some distance back by the CPT Elo system and Jones. I suspect that Jones' not-so-good performance is not real but rather is due to the limited data I have for his system. Next comes the NCAA's ANCRPI. significantly farther down come the other systems. At the bottom, the Elo system using a common starting rating is abysmal.

    Discrimination Based on Conference Strength

    [​IMG]

    For this performance measure, the three Elo-based systems, the 5 Iteration URPI, the NCAA's ANCRPI, and the combined 2015 ARPI/Conference Standing and Conference Rank systems really are about the same and deal quite well with the RPI's problem rating conferences within a single national system. The CPT Improved ARPI and the NCAA's 2009 ARPI are a little behind them. Then, farther down come the NCAA's current ARPI and its URPI. Once again, the Elo system using a common starting rating is terrible.

    Opinion: If one accepts that the NCAA never will move to an Elo-based system that doesn't start all teams with a common rating, then there's a very good case for the NCAA moving to the 5 Iteration URPI. The NCAA often has stated that it would be willing to consider an alternative to the RPI provided that the system meets the NCAA's basic policy positions (e.g., it won't start with initial ratings based on past performance; and it won't consider goal differential). The 5 Iteration URPI meets all of the NCAA's policy positions and outperforms the NCAA's current RPI. I honestly can't think of a good, legitimate reason why the NCAA wouldn't use it.
     
    orange crusader repped this.
  18. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    The preceding group of posts, and particularly the one just preceding this one, help make a point about rating systems. If you only are concerned about how well a system's ratings match the results of the games from which they are derived, there's very little difference among the "better" of the systems:

    All Games: When you look at how the systems' ratings match results for all games, of the "better" systems, the CPT 5 Iteration URPI does the best with a 73.2% match rate and the NCAA's current version of the ARPI does the poorest with a 72.6% match rate. Each 0.1% represents a 3 game difference per year, so the difference in the times ratings match game results between the two is 18 games per year out of 3,100.

    Top 60 Teams' Games: Here, Massey is the best at a 78.3% match rate. The CPT Elo system is the poorest at a 77.1% match rate. Each 0.1% represents a 1 game difference per year, so the difference in matches is 12 games per year out of 935.
    These differences in the rate at which ratings match results are insignificant, to me. This certainly should be true from an NCAA perspective, since the NCAA is willing to use the Adjusted Non-Conference RPI, with a match rate much poorer than the above numbers: At 68.3% for all games, it misses 147 more games per year, out of 3,100, than the CPT 5 Iteration URPI; and at 73.2% for the Top 60 teams' games, it misses 51 more games per year, out of 935, than Massey.

    What this means to me is that when I look at the "better" systems, accuracy is not a good basis for choosing one system over another. Their accuracy, for practical purposes, is the same. Rather, the best basis for choosing must come from looking to see if there are patterns to where systems' ratings are inconsistent with results. There are two ways in which a system's ratings can "miss" results. The system's distribution of misses can be random. Or, the distribution can follow a pattern that discriminates against some identifiable groups of teams and in favor of others.

    My position is that it is better if a system distributes misses randomly rather than in a way that discriminates among identifiable groups of teams.

    In one respect, the NCAA has taken a contrary position. It has stated that it will not consider a system that gives teams initial ratings based on past years' performance. Its rationale for this is that such a system would discriminate against teams that have not been outstanding in prior years and in favor of teams that have. On the other hand, as the preceding group of posts shows, the NCAA's ARPI on average discriminates against teams from stronger conferences and regions and in favor of those from weaker ones. The NCAA's RPI staff is sophisticated about rating systems, and I am confident they know this even though they won't admit it. So, they have taken a position in favor of that discrimination as a better alternative to discriminating against teams that have not been outstanding in prior years. Further, I don't believe this is simply a choice between the lesser or two evils. The NCAA's 2009 ARPI had less conference- and region-based discrimination than its current version of the ARPI. Indeed, the NCAA has made successive changes in the ARPI that have made it more and more discriminatory. This suggests to me that the NCAA actually finds some level of discrimination among conferences and regions to be desirable -- it doesn't want the NCAA Tournament's field of at large selections to consist exclusively of teams from the major conferences. Instead, it wants to give teams from the mid-majors an edge, if slight, and likewise wants to ensure that a team with a poorer history but great success this year has a maximized opportunity to get an at large selection.

    The NCAA's position would be defensible, at least in part, if it necessarily were a choice between the lesser of two evils. But, as I've stated, the NCAA not only has chosen one of the evils but has made it more evil. And, with the 5 Iteration URPI, there is an alternative that effectively eliminates the "evil" as to conferences and greatly reduces it as to regions, but that also does not discriminate a mid-major "Cinderella" team. I suspect that the NCAA is well aware of the possibility of its using a multiple iteration RPI -- it's much too obvious an alternative system for it not to have taken a look at it. If I'm right, then this suggests that the NCAA -- perhaps at a staff level -- supports a level of discrimination that helps historically weaker teams as compared to historically stronger teams. In other words, it doesn't want a rating system that maximizes the random distribution of the system's rating "misses."

    I admit that I'm attributing a lot of expertise to the NCAA's rating staff. Perhaps I'm wrong and they don't really understanding how the RPI works. But I doubt it.
     
  19. Gilmoy

    Gilmoy Member+

    Jun 14, 2005
    Pullman, Washington
    Nat'l Team:
    United States
    I recognize that from my last RPG :p I worked for The Council for 90% of the game, thinking they were the Good Guys, until that semi-climactic Big Reveal scene where I present my evidence, their faces go all demonic, they laugh at me, and I wake up in The Maze.

    Then I get out of the maze, pop up in the middle of the throne room ... and that's the climactic battle at the end. Everybody cast Anti-Magic Shell or we'll be up to our ears in summoners summoning other summoners ...

    Hint: If you wake up in The Maze tomorrow, just map it, then read your map as text ;)
     
    cpthomas repped this.
  20. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    #70 cpthomas, Mar 26, 2016
    Last edited: Mar 26, 2016
    In my preceding series of posts, after ruling out the Elo-based systems because they use starting ratings based on past performance and the NCAA has a policy against doing that, I suggested that the 5 Iteration URPI would be an improvement over the NCAA's 2015 ARPI. I also suggested that the NCAA would not have any legitimate reason for rejecting the 5 Iteration URPI.

    Since then, I've looked at series of refinements to the 5 Iteration URPI that improve it. There is nothing novel about the refinements -- they all (but one, which I ultimately rejected) are refinements the NCAA has used in the past to its ownURPI. In this post, I'll discuss the series of refinements. Hopefully, this will give some insight into the "right" way to go about making refinements to improve the RPI.

    First, I'll start out describing some basic aspects of the NCAA's current -- 2015 -- Adjusted RPI formula:

    URPI: The NCAA starts out with its basic Unadjusted RPI: composed of a team's winning percentage, the average of its opponents' winning percentages against other teams, and the average of its opponents' opponents' winning percentages. The second element average of its opponents' winning percentages against other teams is odd because of the "against other teams" language. This creates an oddity, where what Team B contributes to an opponent's rating varies from opponent to opponent depending on how many times Team B played the opponent and what the game results were. Thus although each opponent played the same Team B, Team B's contribution to each opponent's rating is not the same. I've experimented with an RPI basic structure variation that instead uses the average of a team's opponents' winning percentages, eliminating the "against other teams" limitation. I call this the "Pure" URPI. This is one of the 5 Iteration URPI refinements I tried out.

    Adjusted RPI: The NCAA makes adjustments to the URPI, to give teams bonuses for good wins or ties and to penalize them for poor losses or ties. The adjustments vary from sport to sport and in some cases the NCAA does not make any adjustments. The relevant sport committee decides what the adjustment rules will be -- in our case, the Division I Women's Soccer Committee. The NCAA's current adjustments for Division I women's soccer follow this format:

    1. There are ranking tiers that determine the amounts of the adjustments. For Division I women's soccer, the Committee always has used two bonus ranking tiers and two penalty ranking tiers. Within each tier, the amount of a bonus or penalty depends on whether the game was home, away, or at a neutral site.

    2. The Committee also decides whether to (a) apply bonuses and penalties for all games or (b) only apply them for non-conference games.
    At least starting in 2009 and for several years after, the Committee's RPI formula decisions called for relatively high adjustment amounts for good wins/ties and poor losses/ties. The two bonus tiers covered wins/ties against teams ranked 1-40 (higher bonuses) and 41-80 (lower bonuses). The two penalty tiers covered losses/ties against teams ranked 136-205 (lower penalties) and 206 and poorer (higher penalties). The bonuses and penalties applied for all games, both conference and non-conference.

    A few years ago, however, the Committee changed the adjustment formula. It reduced the amounts of the adjustments. It left the two bonus tiers "as is," but it changed the penalty tiers so that the higher penalties applied only to losses/ties against the 40 most poorly ranked teams and the lower penalties applied to the next group of 40 most poorly ranked teams. And, it eliminated bonuses and penalties for conference games so that they now apply only to non-conference games. As I've pointed out many times, the effect of these changes has been to increase the RPI's discrimination against the stronger conferences and regions and in favor of the weaker conferences and regions.​
    With the above as context, here is the series of refinements I made:

    1. Replace URPI with Pure URPI. This produced what I concluded was a slight improvement:

    Results Match Ratings: URPI 72.6%; Pure URPI 72.5% (higher is better)

    Top 60 Teams' Results Match Ratings: URPI 78.0%; Pure URPI 77.9% (higher is better)

    Regions High to Low Performance Spread: URPI 32.4%; Pure URPI 30.9% (lower is better)

    Regions Cumulative Variance from "Normal": URPI 40.5%; Pure URPI 39.2% (lower is better)

    Discrimination Based on Region Strength: URPI 20%; Pure URPI 20% (lower is better)

    Conferences High to Low Performance Spread: URPI 30.5%; Pure URPI 26.2% (lower is better)

    Conferences Cumulative Variance from "Normal": URPI 165.6%; Pure URPI 155.3% (lower is better)

    Discrimination Based on Conference Strength: URPI 17%; Pure URPI 14% (lower is better)

    2. Replace the PURE URPI with the 5 Iteration Pure URPI. This produced a more significant improvement:

    Results Match Ratings: Pure URPI 72.5%; 5 It Pure URPI 73.2%

    Top 60 Teams' Results Match Ratings: Pure URPI 77.9%; 5 It Pure URPI 78.1%

    Regions High to Low Performance Spread: Pure URPI 30.9%; 5 It Pure URPI 17.4%

    Regions Cumulative Variance from "Normal": Pure URPI 39.2%; 5 It Pure URPI 23.9%

    Discrimination Based on Region Strength: Pure URPI 20%; 5 It Pure URPI 12%

    Conferences High to Low Performance Spread: Pure URPI 26.2%; 5 It Pure URPI 23.5%

    Conferences Cumulative Variance from "Normal": Pure URPI 155.3%; 5 It Pure URPI 87.6%

    Discrimination Based on Conference Strength: Pure URPI 14%; 5 It Pure URPI 5%
    3. Replace the 5 Iteration Pure URPI with the 5 Iteration Pure Adjusted RPI, using the 2015 (current) bonus and penalty amounts and tiers and limiting the bonuses and penalties to non-conference games. This produced a slight improvement:

    Results Match Ratings: 5 It Pure URPI 73.2%; 5 It Pure ARPI (2015) 73.2%

    Top 60 Teams' Results Match Ratings: 5 It Pure URPI 78.1%; 5 It Pure ARPI (2015) 77.9%

    Regions High to Low Performance Spread: 5 It Pure URPI 17.4%; 5 It Pure ARPI (2015) 17.5%

    Regions Cumulative Variance from "Normal": 5 It Pure URPI 23.9%; 5 It Pure ARPI (2015) 24.2%

    Discrimination Based on Region Strength: 5 It Pure URPI 12%; 5 It Pure ARPI (2015) 11%

    Conferences High to Low Performance Spread: 5 It Pure URPI 23.5%; 5 It Pure ARPI (2015) 21.6%

    Conferences Cumulative Variance from "Normal": 5 It Pure URPI 87.6%; 5 It Pure ARPI (2015) 78.4%

    Discrimination Based on Conference Strength: 5 It Pure URPI 5%; 5 It Pure ARPI (2015) 5%
    4. Replace the 5 Iteration Pure ARPI (2015) with the Iteration 5 Pure ARPI but with the 2009 bonus amounts and tiers, limited to non-conference games. This produced a further slight improvement:

    Results Match Ratings: 5 It Pure ARPI (2015) 73.2%; 5 It Pure ARPI (2009, NC) 72.1%

    Top 60 Teams' Results Match Ratings: 5 It Pure ARPI (2015) 77.9%; 5 It Pure ARPI (2009, NC) 77.9%

    Regions High to Low Performance Spread: 5 It Pure ARPI (2015) 17.5%; 5 It Pure ARPI (2009, NC) 16.9%

    Regions Cumulative Variance from "Normal": 5 It Pure ARPI (2015) 24.2%; 5 It Pure ARPI (2009, NC) 23.3%

    Discrimination Based on Region Strength: 5 It Pure ARPI (2015) 11%; 5 It Pure ARPI (2009, NC) 11%

    Conferences High to Low Performance Spread: 5 It Pure ARPI (2015) 21.6%; 5 It Pure ARPI (2009, NC) 20.0%

    Conferences Cumulative Variance from "Normal": 5 It Pure ARPI (2015) 78.4%; 5 It Pure ARPI (2009, NC) 73.7%

    Discrimination Based on Conference Strength: 5 It Pure ARPI (2015) 5%; 5 It Pure ARPI (2009, NC) 4%
    5. Replace the 5 Iteration Pure ARPI (2009, NC) with the 5 Iteration Pure ARPI (2009, All), in other words, extend the bonuses and penalties to include both conference and non-conference games. This produced another improvement, including the elimination of all discrimination based on conference strength:

    Results Match Ratings: 5 It Pure ARPI (2009, NC) 73.1%; 5 It Pure ARPI (2009, All) 73.2%

    Top 60 Teams' Results Match Ratings: 5 It Pure ARPI (2009, NC) 77.9%; 5 It Pure ARPI (2009, All) 77.9%

    Regions High to Low Performance Spread: 5 It Pure ARPI (2009, NC) 16.9%; 5 It Pure ARPI (2009, All) 17.8%

    Regions Cumulative Variance from "Normal": 5 It Pure ARPI (2009, NC) 23.3%; 5 It Pure ARPI (2009, All) 26.9%

    Discrimination Based on Region Strength: 5 It Pure ARPI (2009, NC) 11%; 5 It Pure ARPI (2009, All) 11%

    Conferences High to Low Performance Spread: 5 It Pure ARPI (2009, NC) 20.0%; 5 It Pure ARPI (2009, All) 17.6%

    Conferences Cumulative Variance from "Normal": 5 It Pure ARPI (2009, NC) 73.7%; 5 It Pure ARPI (2009, All) 72.6%

    Discrimination Based on Conference Strength: 5 It Pure ARPI (2009, NC) 4%; 5 It Pure ARPI (2009, All) 0%
    6. Replace the 5 It Pure ARPI (2009, All) with the 5 It ARPI (2009, All). I did this on an educated hunch, reverting from the Pure Unadjusted RPI as the formula base to the regular URPI base that the NCAA uses. It produced a final slight improvement. For this "final" product, I'll show the comparison to both the Pure version of the formula and to the 2015 ARPI the NCAA currently uses:

    Results Match Ratings: NCAA 2015 ARPI 72.6%; 5 It Pure ARPI (2009, All) 73.2%; 5 It ARPI (2009, All) 73.2%

    Top 60 Teams' Results Match Ratings: NCAA 2015 ARPI 77.9%; 5 It Pure ARPI (2009, All) 77.9%; 5 It ARPI (2009, All) 77.8%

    Regions High to Low Performance Spread: NCAA 2015 ARPI 31.3%; 5 It Pure ARPI (2009, All) 17.8%; 5 It ARPI (2009, All) 16.1%

    Regions Cumulative Variance from "Normal": NCAA 2015 ARPI 38.8%; 5 It Pure ARPI (2009, All) 26.9%; 5 It ARPI (2009, All) 24.9%

    Discrimination Based on Region Strength: NCAA 2015 ARPI 20%; 5 It Pure ARPI (2009, All) 11%; 5 It ARPI (2009, All) 10%

    Conferences High to Low Performance Spread: NCAA 2015 ARPI 29.2%; 5 It Pure ARPI (2009, All) 17.6%; 5 It ARPI (2009, All) 16.4%

    Conferences Cumulative Variance from "Normal": NCAA 2015 ARPI 161.2%; 5 It Pure ARPI (2009, All) 72.6%; 5 It ARPI (2009, All) 75.6%

    Discrimination Based on Conference Strength: NCAA 2015 ARPI 16%; 5 It Pure ARPI (2009, All) 0%; 5 It ARPI (2009, All) 0%
    My conclusion is to be very happy with the 5 Iteration ARPI using the 2009 bonus and penalty adjustment amounts and tiers, applied to both conference and non-conference games. In terms of accuracy and fairness, this is as good a system as I can develop (apart from an Elo based system using starting ratings based on earlier seasons' history), it's superior to the NCAA's ARPI, and there's no plausible reason I can think of for the NCAA's not using it.

    Furthermore, as the following shows, the 5 Iteration ARPI (2009, All) disposes of the need for the NCAA to use the significantly less accurate Non-Conference ARPI:

    Results Match Ratings: NCAA 2015 NCARPI 68.3%; 5 It ARPI (2009, All) 73.2%

    Top 60 Teams' Results Match Ratings: NCAA 2015 NCARPI 73.2%; 5 It ARPI (2009, All) 77.8%

    Regions High to Low Performance Spread: NCAA 2015 NCARPI 18.0%; 5 It ARPI (2009, All) 16.1%

    Regions Cumulative Variance from "Normal": NCAA 2015 NCARPI 22.3%; 5 It ARPI (2009, All) 24.9%

    Discrimination Based on Region Strength: NCAA 2015 NCARPI 14%; 5 It ARPI (2009, All) 10%

    Conferences High to Low Performance Spread: NCAA 2015 NCARPI 20.9%; 5 It ARPI (2009, All) 16.4%

    Conferences Cumulative Variance from "Normal": NCAA 2015 NCARPI 110.2%; 5 It ARPI (2009, All) 75.6%

    Discrimination Based on Conference Strength: NCAA 2015 NCARPI -2%; 5 It ARPI (2009, All) 0%​
     
  21. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    In the preceding post, I think I demonstrated why the 5 Iteration ARPI, using the NCAA's 2009 bonus and penalty amounts and tiers, with bonuses and penalties for both conference and non-conference games, is significantly superior to the NCAA's current 2015 ARPI (and NCARPI, also). There's another reason, however, that makes the 5 Iteration ARPI even more superior.

    A serious problem of the NCAA's ARPI is that a team's ARPI rank and its rank in terms of what it contributes to its opponents' strengths of schedule can be very different. Thus Team A can contribute a lot more to an opponent's strength of schedule than Team A's ARPI rank merits; and conversely Team B can contribute a lot less to an opponent's strength of schedule than Team B's ARPI rank merits. This, on its own, is a problem. But, to make the problem worse, it gives teams, that are future potential bubble teams, the incentive to try to "game" the RPI by scheduling non-conference opponents whose Strength of Schedule contributions are likely to be higher than their ARPI rankings.

    To illustrate this, I prepared a table for the 2015 season. In the first column, the table shows the difference between teams' NCAA 2015 ARPI rank and teams' NCAA 2015 ARPI Strenth of Schedule contribution rank -- the formula for each entry in the column is: ARPI Rank - S0S Contribution Rank. I've arranged the column from the lowest number to the highest. A 0 (zero) means that the Ranks match; a - (negative) number means the S0S Rank is poorer than the ARPI Rank; a + (positive) number means that the SoS Rank is better than the ARPI Rank. In the second column, the table shows the differences for the 5 Iteration URPI. In the third column, the table shows the differences for the 5 Iteration ARPI (2009, All).

    The table demonstrates well why teams have incentive to try to game the NCAA's ARPI in their scheduling of non-conference opponents. There are some huge differences between teams' ARPI ranks and their SoS contribution ranks. On the other hand, for the 5 Iteration ARPI and URPI, the differences are much smaller and, based on my practical experience, not sufficient to make trying to game the system worthwhile.

    [​IMG]
     
  22. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    Recently, I gave the Women's Soccer Committee a proposal to change the RPI formula to address some of the current formula's major problems. The proposal is to change to what I call the 5 Iteration Adjusted RPI using the 2009 Bonus and Penalty Points regime. I'm not going to go into a detailed discussion of the proposed new formula here. For details on the proposed new formula, you can go to the "RPI: Modified RPI?" page of the RPI for Division I Women's Soccer website.

    As I show in detail at the "RPI: Modified RPI?" webpage, the 5 Iteration ARPI formula provides ratings that are at least as consistent with game results as the NCAA's ARPI versions. More important:
    • The 5 Iteration ARPI rates conferences more accurately in relation to each other (1) in terms of general fairness and (2) in relation to conference strength. In fact, the 5 Iteration ARPI eliminates the NCAA ARPI's biases in relation to conference strength.
    • The 5 Iteration ARPI rates the regional playing pools more accurately in relation to each other (1) in terms of general fairness and (2) in relation to region strength. It doesn't eliminate the biases in relation to region strength of the NCAA's ARPI, but it significantly reduces the biases.
    • The 5 Iteration ARPI, for practical purposes, eliminates the disconnect that the NCAA's ARPI versions have, between teams' ARPI ranks and their ranks as contributor to opponents' strengths of schedule. Thus the 5 Iteration ARPI will eliminate the incentive and need of coaches of potential bubble teams to try to "game" the system in their scheduling of non-conference opponents. Under the 5 Iteration ARPI, an opponent's value in terms of contribution to your strength of schedule will be roughly the same as its actual rank. This is as distinguished from the NCAA's RPI versions, where an opponent's value in terms of contribution to your strength of schedule can be very different than its actual rank.
    Again, I cover all of this in detail at the "RPI: Modified RPI?" page.

    There currently is an opening for changes to the rating formula that has not been there in the past. This is because, driven by basketball, the NCAA is in the process of taking a careful look at the RPI as well as at other rating formulas.

    So, if you're interested in the NCAA's rating system for Division I women's soccer, and a way to greatly improve it, use the above link to get a sense of how the NCAA's current system works and of how a much better alternative -- and some other better systems -- work.
     
  23. cpthomas

    cpthomas BigSoccer Supporter

    Portland Thorns
    United States
    Jan 10, 2008
    Portland, Oregon
    Nat'l Team:
    United States
    I'll add a little note about the revised RPI- based rating system I've proposed to the Women's Soccer Committee:

    I just completed a test comparing the NCAA's current 2015 ARPI used for Division I Women's Soccer to my proposed 5 Iteration ARPI, it terms of which better matches with the Women's Soccer Committee's actual seeding and at large selection decisions over the last 10 years. The 5 Iteration ARPI matches better, and by a decent margin as to at large selections.
     
  24. orange crusader

    May 2, 2011
    Club:
    --other--
    Thanks for your work on this cp! I hope the NCAA implements your proposal.
     

Share This Page