2012 rpi

cpthomas · Oct 15, 2012

kolabear said: ↑

It's a qualitative argument I'm making and, frankly, I think it's good enough. A system that ranks a team, say #117 as it currently does USC, but for the purposes of computing USC's opponents treats them as the # 212 team in the nation, isn't weighing the strength-of-schedule in a way that is logically consistent even with itself.
Click to expand...

It's not the end of the season yet, so your USC example at this point may be illustrative, but in terms of scale it isn't valid. The RPI is designed only to be used at the end of the year. There no doubt will be other examples at the end of the year, but the scale may be different. Also, when you say this is "par for the course" for the RPI, it sounds like you are saying that we can expect to see this much disparity normally for teams since that seems to be what "par for the course" would mean. Is that really what you mean?

Also, I think I said this last year, but I'll say it again. As you've pointed out, the RPI has some internal consistency problems. They show up with some teams contributing more or less to their opponents' strengths of schedule than their actual ratings indicate is appropriate, sometimes by quite a bit. They are relatively easy to spot because the RPI's internal workings are fully exposed. But, you have not been able to fully expose the internal workings of any of the systems you might advocate, so for all we know it's just as possible that the systems you might advocate for have equally onerous problems.

I think ultimately, the proof is in how well any system's ratings correlate with the game results of the season from which the ratings are derived. In that respect, I don't think it really matters whether one system or another is more "theoretically" correct. And, I don't think it really matters if a system has internal inconsistencies if its ratings correlate the best with game results. The job of the system challenger is to show that her or his proposed alternative will produce better correlations with game results. For the short period for which I had Jones ratings, I wasn't able to do that even though I wanted to. And, so far, I haven't been able to do it with Massey. Nor has anyone else that I know of.

Now, if you really want to get peeved, think about using any one of these systems applied to the limited number of football games each year to produce BCS rankings that determine the distribution of hundreds of millions of dollars.

kolabear · Oct 16, 2012

cpthomas said: ↑

It's not the end of the season yet, so your USC example at this point may be illustrative, but in terms of scale it isn't valid. The RPI is designed only to be used at the end of the year. There no doubt will be other examples at the end of the year, but the scale may be different. Also, when you say this is "par for the course" for the RPI, it sounds like you are saying that we can expect to see this much disparity normally for teams since that seems to be what "par for the course" would mean. Is that really what you mean?

I'm not saying that all teams show this disparity. What is par for the course is that there will be numerous teams with drastic disparities. It's been that way every year I've looked.

Also, I think I said this last year, but I'll say it again. As you've pointed out, the RPI has some internal consistency problems. They show up with some teams contributing more or less to their opponents' strengths of schedule than their actual ratings indicate is appropriate, sometimes by quite a bit. They are relatively easy to spot because the RPI's internal workings are fully exposed. But, you have not been able to fully expose the internal workings of any of the systems you might advocate, so for all we know it's just as possible that the systems you might advocate for have equally onerous problems.

a) I'm trusting the math people on this to a great extent, true. But that's what they're trained to do.

b) There is a certain way of checking Elo-like systems which I've mentioned before and will try to summarize again below.

I think ultimately, the proof is in how well any system's ratings correlate with the game results of the season from which the ratings are derived. In that respect, I don't think it really matters whether one system or another is more "theoretically" correct. And, I don't think it really matters if a system has internal inconsistencies if its ratings correlate the best with game results. The job of the system challenger is to show that her or his proposed alternative will produce better correlations with game results. For the short period for which I had Jones ratings, I wasn't able to do that even though I wanted to. And, so far, I haven't been able to do it with Massey. Nor has anyone else that I know of.

Now, if you really want to get peeved, think about using any one of these systems applied to the limited number of football games each year to produce BCS rankings that determine the distribution of hundreds of millions of dollars.
The whole BCS thing is a joke and so is even the alternative playoff system that's talked about. It's all a big turnoff to me but so much of American football is becoming that way to me.
Click to expand...

I mentioned a way of "checking" an Elo system which I find reassuring and gives a certain amount of validity to the whole approach. I can't say I exhaustively checked this out in previous years but I did a fair amount of checking.

The method is this: calculate, or re-calculate (or retroactively calculate), what a team's rating should be given a) its record, b) the ratings that the system (say Albyn Jones) assigned to all the teams (specifically, for the sake of the team whose rating you're recalculating, the ratings of all its opponents), and c) the rating scale defined for that rating system.

Crude example: Team A won 10 games and lost 5 against teams with an average rating of 1500 according to Albyn Jones. According to the Albyn Jones scale, a team with a .667 record against its opponents should be rated 100 points higher than them. So Team A should have a rating of 1600.

If you do this for all teams and for each and every team the resulting calculation comes out close to their rating, that tells you the system's pretty close, doesn't it? Of course it's not going to be exact - real life is messier than that. But if it's close for every single team, that tells you something doesn't it? For just one Team, Team A, maybe not. Maybe the system just happened to get it right. Or maybe the system was set up to make sure a few teams came out about right. But if it worked out for every one of their opponents - let's say Team B and we calculated Team B's rating based on their results and the ratings of their opponents - and the recalculation was close. And then we took Team B's opponents and did the same thing -- and so on and so on, well, it tells you there's a proper relation between all these different teams, doesn't it? and that's exactly what the ratings are supposed to do.

If the teams that are rated 1600 are beating the teams rated 1500 about 67% of the time, that's a sign the system's working. If they're losing 67% of the time to the teams rated 1700, that's a sign it's working. And the teams rated 1400 were losing 67% of the time to the 1500 teams but winning 67% of the time to the teams rated 1300 - and so on and so on...

So, as I've mentioned over the last few years, I've done this checking and illustrated it, first for Albyn Jones, then one year for a modified Massey Rating. And it worked out pretty well.

This is a powerful level of validation that an Elo system can achieve that the RPI cannot and should give us some level of confidence in how that kind of system can work.

Damn I miss Albyn Jones!

cpthomas · Oct 16, 2012

FYI, last year at the end of the season 10 teams out of 322 had a rating/strength-of-schedule contribution disparity at the level of, or greater than, what you set out for Southern California at the current stage this year:

ArkansasPineBluff
Rider
SELouisiana
HoustonBaptist
NorthDakotaState

Vanderbilt
NorthwesternU
SouthernCalifornia
ArkansasU
OklahomaU

For the top group, the team's contribution to strength of schedule was better than its RPI ranking says it should have been. For the bottom group, the team's S0S contribution was poorer than its RPI ranking says it should have been. Most likely, the top group was beating poor teams and the bottom group was losing to good teams.

(169 out of 322 teams had an ARPI rank and a SoS contribution rank within 25 positions of each other.)

The question, for RPI evaluation purposes, is whether these kinds of issues have an effect on the at large selections and seeds for the NCAA tournament. If they don't, then for practical purposes it doesn't matter since the only purpose of the RPI -- as distinguished from Jones, Massey, or a comparable system -- is to help with the selections and seeds.

Your trusting the math people doesn't sound persuasive. The RPI's basic structure was developed by mathematicians at Stanford.

I've guessed that the NCAA gave the Stanford math people some specific instructions: the data used in the system must be limited to who the opponents were and the game result - win, loss, tie (with the bonus/penalty system, which is dependent on game location, coming later); and the formula must be one that college coaches, athletics administrators, and serious fans and the media can understand (remember, the RPI was developed for basketball). If my guess is right, then the latter instruction ruled out the kinds of systems you advocate. I think the latter instruction (if given) is a good one, especially since it has appeared to me so far that the RPI's results, for practical purposes, are about as good as the results of more sophisticated approaches (to the chagrin of those who advocate for the more sophisticated approaches). My recollection is that you would rather trust the sophisticated approaches even if college coaches, administrators, and fans and the media can't understand them.

kolabear · Oct 16, 2012

Thanks for doing that research. Great work as always.

That difference or disparity for USC is 100 places! That's a third of the entire Division 1! That takes you from top third to bottom third or vice versa!

As for my trusting the mathematicians, a) it's not a blind trust, given the kind of checking that we can do that I mentioned, b) I'm sure there are a lot of fans who better understand the mathematics than I do (I'm hardly the gold standard in regards to that), c) fans in most of the major sports already accept the use of methods that are more complex than RPI and they're not clamoring for a return to RPI. (The problems with football, the BCS and proposed playoff system have less to do with the computer ranking methods themselves than the problems peculiar to football itself - significantly fewer number of games -- ie data -- but demand from fans and media for a national champion) (and chessplayers and players of online games have accepted the use of Elo ratings as a matter of fact for years)

The RPI, as you've noted, does plausibly well despite its kluge-like nature. It has the added benefit of being double-checked by avid fans like you and the guys at nc-soccer. And fans benefit from your analysis as the season progresses and the final RPI takes shape.

It meets the need for an objective measurement adequately - while far from ideal, it's better than relying on polls. But it's not surprising either that it's going to contain a handful of highly dubious results each year. It's fair to point them out and that it seems unlikely such serious aberrations would show up in an Elo rating system

cpthomas · Oct 16, 2012

kolabear said: ↑

Thanks for doing that research. Great work as always.

That difference or disparity for USC is 100 places! That's a third of the entire Division 1! That takes you from top third to bottom third or vice versa!

As for my trusting the mathematicians, a) it's not a blind trust, given the kind of checking that we can do that I mentioned, b) I'm sure there are a lot of fans who better understand the mathematics than I do (I'm hardly the gold standard in regards to that), c) fans in most of the major sports already accept the use of methods that are more complex than RPI and they're not clamoring for a return to RPI. (The problems with football, the BCS and proposed playoff system have less to do with the computer ranking methods themselves than the problems peculiar to football itself - significantly fewer number of games -- ie data -- but demand from fans and media for a national champion) (and chessplayers and players of online games have accepted the use of Elo ratings as a matter of fact for years)

The RPI, as you've noted, does plausibly well despite its kluge-like nature. It has the added benefit of being double-checked by avid fans like you and the guys at nc-soccer. And fans benefit from your analysis as the season progresses and the final RPI takes shape.

It meets the need for an objective measurement adequately - while far from ideal, it's better than relying on polls. But it's not surprising either that it's going to contain a handful of highly dubious results each year. It's fair to point them out and that it seems unlikely such serious aberrations would show up in an Elo rating system
Click to expand...

Agreed.

Assuming there are some aberrations in an Elo rating system -- i.e., using your checking method I assume teams don't always come in exactly where they should -- it would be really interesting to see a team-by-team analysis of how many positions each team is "off" to compare to a list for the RPI like the one I extracted the above teams from.

Related to the BCS problem, the problem of not enough games is exacerbated by geographic considerations -- not enough inter-regional "correspondence." Really, the use of so called statistical ratings as part of the BCS formula probably is more for the BCS rating system's entertainment value than anything else.

cpthomas · Oct 16, 2012

The NCAA now has published its weekly RPI report for games through Sunday, October 14, here: http://www.ncaa.com/rankings/soccer-women/d1/ncaa_womens_soccer_rpi. My rankings and nc-soccer's match the NCAA's rankings exactly.

The NCAA has not yet published its weekly Team Sheets, which contain teams' actual ratings. With the exact match in rankings, however, I'm confident the ratings also match exactly.

It appears that the NCAA's data gathering system is working really well this year. I'm sure that's in large part due to a strong effort by the NCAA staff to get the schools (or conferences) to enter complete and accurate data into the NCAA's system. Credit to the NCAA staff!

kolabear · Oct 16, 2012

cpthomas said: ↑

Agreed.

Assuming there are some aberrations in an Elo rating system -- i.e., using your checking method I assume teams don't always come in exactly where they should -- it would be really interesting to see a team-by-team analysis of how many positions each team is "off" to compare to a list for the RPI like the one I extracted the above teams from.

Related to the BCS problem, the problem of not enough games is exacerbated by geographic considerations -- not enough inter-regional "correspondence." Really, the use of so called statistical ratings as part of the BCS formula probably is more for the BCS rating system's entertainment value than anything else.
Click to expand...

The "re-calculated" ratings aren't going to be off by much because, as I think someone like Pipsqueak or Craig P would explain it, "by definition" they have to be close. In other words, the algorithm is specifically designed to produce numbers that are internally consistent in this fashion. If Team A's "re-calculated" rating - the rating that I'm describing as a way of double-checking the system - if it was way off, then the algorithm would have rejected the entire batch of ratings and looked for a better batch of ratings so Team A's "recalculated" rating would've been closer, as well as for every other team in the rating pool.

Pipsqueak and Craig P understand the math here much better than I do but I hope I'm giving a general sense of how this kind of rating works and I hope it's not too gross a distortion of what those two have said in the past.

Tom81 · Oct 16, 2012

cpthomas said: ↑

The NCAA now has published its weekly RPI report for games through Sunday, October 14, here: http://www.ncaa.com/rankings/soccer-women/d1/ncaa_womens_soccer_rpi. My rankings and nc-soccer's match the NCAA's rankings exactly.

The NCAA has not yet published its weekly Team Sheets, which contain teams' actual ratings. With the exact match in rankings, however, I'm confident the ratings also match exactly.

It appears that the NCAA's data gathering system is working really well this year. I'm sure that's in large part due to a strong effort by the NCAA staff to get the schools (or conferences) to enter complete and accurate data into the NCAA's system. Credit to the NCAA staff!
Click to expand...

I'm late to this thread.
I just clicked the link.
CAn you give me the 'dummy version' of why Stanford with a loss and a tie is ahead of FSU?

It's not a big deal as Soccer let's it all play out on the field.
JUst curious.

cpthomas · Oct 16, 2012

Tom81 said: ↑

I'm late to this thread.
I just clicked the link.
CAn you give me the 'dummy version' of why Stanford with a loss and a tie is ahead of FSU?

It's not a big deal as Soccer let's it all play out on the field.
JUst curious.
Click to expand...

A team's RPI rating is a combination of its winning percentage and its strength of schedule. According to the RPI, Stanford's strength of schedule to date is substantially higher than Florida State's, by an amount that outweighs Florida State's better winning percentage. This accounts for the difference. If both teams win out, Stanford probably will remain ahead of Florida State in the RPI though by a lesser amount.

One of Florida State's problems is that it played two non-conference games against what currently appear to be very weak opponents: College of Charleston and Jackson State.

kolabear · Oct 16, 2012

Tom81 said: ↑

I'm late to this thread.
I just clicked the link.
CAn you give me the 'dummy version' of why Stanford with a loss and a tie is ahead of FSU?

It's not a big deal as Soccer let's it all play out on the field.
JUst curious.
Click to expand...

Whether it's RPI or Massey, Sagarin or an Elo system, it's not unusual for one team to be ranked above another team even if they lost to them. It shouldn't be surprising either because in the real world of sports you often have scenarios where team A beats team B and team B beats team C, but instead of team A beating team C, team C beats team B. Happens all the time.

Having said that, and looking at the Massey Ratings and their records, Florida State is ranked ahead of Stanford in the Massey Ratings and quite frankly there's no way in an Elo system that Stanford would be ranked ahead of Florida State currently.

cpthomas · Oct 16, 2012

kolabear said: ↑

Whether it's RPI or Massey, Sagarin or an Elo system, it's not unusual for one team to be ranked above another team even if they lost to them. It shouldn't be surprising either because in the real world of sports you often have scenarios where team A beats team B and team B beats team C, but instead of team A beating team C, team C beats team B. Happens all the time.

Having said that, and looking at the Massey Ratings and their records, Florida State is ranked ahead of Stanford in the Massey Ratings and quite frankly there's no way in an Elo system that Stanford would be ranked ahead of Florida State currently.
Click to expand...

There is, however, another way to look at it: Florida State played two absolute gimmes. Suppose they had played two teams with sufficient strength for FSU to match Stanford's strength of schedule. Under that scenario, it's possible FSU would not still be undefeated.

Given the way the Women's Soccer Committee seeds the NCAA Tournament, for tournament purposes it's probably academic. They seed four #1s, 4 #2s, etc. At this point both Stanford and FSU almost certainly would be getting #1 seeds. It's often thought that the team at the upper left of the bracket is the Committee's top seed out of the four #1s, although there's not way to prove that is what the Committee does. If that is what the Committee does and if FSU wins out, I wouldn't be surprised to see FSU in that spot because the RPI is only one factor the Committee considers in seeding.

Soccerhunter · Oct 16, 2012

Cliveworshipper said: ↑

Here is a place to start
http://www.google.com/url?sa=t&rct=j&q=elo, albyn jones rpi&source=web&cd=1&ved=0CDMQFjAA&url=http://jonstlouis.htmlplanet.com/0soc/0a/rankings.htm&ei=cpV8UJiOLMTVigKqwYFo&usg=AFQjCNFW89lKWD3O-8Vg-ii-c1qUpktv7g&sig2=Kve1hPjP26L_Ixig9hC7Ww&cad=rjt

As you read it, think about how different systems deal with 20 datapoints for each of 324 teams.
Then ask how you can better deal with the data for teams not in common playing pools.
Then ask how it could be better. This is the crux of the work that CPT is trying to analyze with his improved RPI. Some sort of normalizing of the playing pools needs to be done, and none of the systems do it.

As to strong conferences manipulating data, I'll point out that the ACC and the WCC are dead even in head to head competition this year ( 2-2-0). Ask yourself how well any of the ranking systems dealt with that fact.
Click to expand...

Thanks, Cliveworshipper and CPThomas for suggesting this little article. The description of the RPI pretty well fits what I understood, but other than making distinctions between "merit based" and "predictive" models, and dynamic versus static, I gained no insight as to how Elo systems are supped to work because the down load of this article did not yield any of the formulas in the boxes.

But thanks for bringing this to my attention.

Cliveworshipper · Oct 17, 2012

Soccerhunter said: ↑

Thanks, Cliveworshipper and CPThomas for suggesting this little article. The description of the RPI pretty well fits what I understood, but other than making distinctions between "merit based" and "predictive" models, and dynamic versus static, I gained no insight as to how Elo systems are supped to work because the down load of this article did not yield any of the formulas in the boxes.

But thanks for bringing this to my attention.
Click to expand...

you might try this site for a simplified insight into the workings of just the Elo system:

http://leagueoflegends.wikia.com/wiki/Elo_rating_system

Note the part of the discussion concerning few data points (in chess 30 games or less) and how the K value is manipulated to get a chess player in the general range of his abilities, high initial K values move a player (team) rapidly through the rankings, but late results count much more than early results. Low K values give a better view of a player's full body of work, and are used later to differentiate closely ranked players. This is ONLY done after 30 or more games, more than an NCAA season's worth of games.

In chess initially this K value is big (25 for their first 30 games) resulting in large changes in Elo. This is so a player can rapidly find his correct place in the ranking system. As their number of wins and losses becomes more even this K value is reduced to prevent dramatic changes in Elo against evenly matched opponents (K = 15 to 7). This also prevents inflation in ratings at high Elo play. It appears that League of Legends uses a similar system of changing K values: K appears to start around 60, eventually leveling out to about 25.[3]
Click to expand...

So in chess, 30 data points is just considered a starting point for ranking. in some elo-like ranking systems a player's ranking isn't even published until that point. I makes you wonder when folks make claims of supperior accuracy and consistency with only 20 data points.

this is the crux of the Elo issue in soccer. if you are going to count ONLY the results for one season as the NCAA demands, you either start from scratch, in which case there arent enough data points to adjust the K value and isolated recent results are out of proportion to the whole body of work, or you use previous years' results and adjust the K value so that by the end of a season, the previous years ranking are swamped by the new data.

This second method is a no-no for the NCAA.

The FIFA women's ranking, for example, uses a K value adjusted so that old results die out in about one World cup cycle (90 games for the USA?), and so individual recent results arent valued appreciably more than old results for the year or so before the ranking is used for seeding. the ranking was first implemented in 2003 with a high K value, then it was reduced in 2005 in time to take effect in the next World Cup selection cycle.
The problem is that is too low a K value for new team to make their mark in one cycle or for previously bad teams to show their current "true ranking" in one cycle.

the NCAA can't abide a system where everyone doesn't start even each year. it's a political decision as long as the history of the organization, and the Elo system just doesnt work well with one season's results, so it doesn't do any better that RPI, for example, when applied that way.

and as I said, in the world of the NCAA, it must be applied that way. 20 results is all you get.

I think the same argument applies to Albyn Jones' system. He never published how he constructed it, or whether by the end of a year previous years results that resulted in the initial seeding inhibit teams from showing only their single season rank. he also doesn't say if or how new or rapidly changing teams get their K value adjusted so can we see how this would affect a new team trying to compete in the rankings with an older, more established team.

without the published math, we just don't know.

Cliveworshipper · Oct 17, 2012

one correction. FIFA increased the K factor in 2005, because results werent being reflected in the ranking in time for one cycle. it is important to have a K factor that reflects change fast enough, while still not making the newest data out of proportion to older data for the cycle.

cpthomas · Oct 17, 2012

Soccerhunter, I don't know if you have the same problem, but I have no idea what a K value is. I found out recently that a college classmate of mine is a professor of statistics, so I've been thinking about asking him if there's a text he can recommend that would be useful for understanding the use of statistical models for ranking sports teams. I think he was a football player, so maybe he knows of one.

Cliveworshipper · Oct 17, 2012

cpthomas said: ↑

Soccerhunter, I don't know if you have the same problem, but I have no idea what a K value is. I found out recently that a college classmate of mine is a professor of statistics, so I've been thinking about asking him if there's a text he can recommend that would be useful for understanding the use of statistical models for ranking sports teams. I think he was a football player, so maybe he knows of one.
Click to expand...

The K value isn't anything mysterious. It's just a number that scales the most recent result in order to determine how fast your current ranking changes. In so doing, it also determines how many of your most recently played games are statistically important. Each time you run the formula after a result, older results become a smaller percentage of the ranking. The larger K is, the more quickly old results effectively go away.

The basic Elo ranking change formula looks like this, from the site I posted previously:

So the terms on the left are new and old rankings, and the terms on the right of the + sign are the result of the win or loss that are to be added or subtracted to your old ranking to make it change.

The Size of K just determines how much the current result affects the change. It is set by whoever constructs the ranking depending on how fast they WANT it to change.

Since it is a recursive system, eventually old results become insignificant ( think of it in terms of a half of a half of a half of a half of a half is a small number) they become insignificant sooner if K is large.

Soccerhunter · Oct 17, 2012

Cliveworshipper said: ↑

The K value isn't anything mysterious. It's just a number that scales the most recent result in order to determine how fast your current ranking changes. In so doing, it also determines how many of your most recently played games are statistically important. Each time you run the formula after a result, older results become a smaller percentage of the ranking. The larger K is, the more quickly old results effectively go away.

The basic Elo ranking change formula looks like this, from the site I posted previously:

So the terms on the left are new and old rankings, and the terms on the right of the + sign are the result of the win or loss that are to be added or subtracted to your old ranking to make it change.

The Size of K just determines how much the current result affects the change. It is set by whoever constructs the ranking depending on how fast they WANT it to change.

Since it is a recursive system, eventually old results become insignificant ( think of it in terms of a half of a half of a half of a half of a half is a small number) they become insignificant sooner if K is large.
Click to expand...

OK, this is all helpful. Thanks Clive.

I am trying to simplify things for a basic understanding. This means trying to understand the big picture and context. (My personality type likes to first get a sense of the the lay of the forest before trying to look at the trees in detail.)

So what I am getting is this: In the world of ranking systems, there is, in fact, quite a bit of commonality.
(1) All systems primarily rely on W-L data to make up the rank ordered list.
(1.a) Systems based exclusively on W-L data are presumed to be more reliable and less open to capricious manipulation and therefor should be preferred if the correct formulas could be agreed upon.
(1.b) Even rankings based on polling presumes that the participants' opinions (votes) are significantly based on W-L data.
(2) There is the recognition that other factors are involved beyond W-L data.
(2.a) Which non-results based factors to use and how to weight them is the critical question and runs headlong into the desire to have a reliable system (see 1.a.)
(2.b.) The major difference between the Elo and RPI is dynamic versus static -- the dynamic weightings (K) being the principal non W-L factor.

The rest of the discussion is all quibbling over details -some of which may be significant in the minds of partisans.

For NCAA soccer, my nascent understanding at this point would list some non W-L factors that might be included in the mix as:

Recent results vs early in the season (K)

how to weight W-L data against ranks of opponents (eg the 25%/50%/25% of NCAA)

mathematically getting the recursive aspects balanced right

minimum number of games

home field advantage

weighting of previous seasons and long term results history (K)

interregional balancing issues

interdivisional balancing (D-I vs D-III etc.)

whether to use individual game goal differentials

rest between games ("tournament teams")

roster issues (injury or redshirt status)

coaching (if a coach changes teams does the new team inherit any advantage?)

should polling opinions be factored in? This would be a nod to presumably "expert" or inside information regarding such factors as: the emotional state of a team, intangibles such as insight into specific player interactions (both athletic and personal), team versus team match-ups (technical vs athletic, etc.), and fitness and other specific preparation issues.

Am I starting to get it?

kolabear · Oct 17, 2012

CliveW is right, the k-value isn't all that mysterious. It may be helpful to go back to chess ratings to understand some of the basics of an Elo rating system.

Let's say my rating is 1600 and I play 10 games against opponents whose average rating is 1700 and I score 5 out of 10. Well, for those 10 games I played at the strength of a 1700 player, not a 1600 player. Is my rating then now 1700? If I played one game against a 1700 player and drew, would that make me a 1700 player? Obviously that doesn't make sense - one game doth not a master, or a 1700 player, make.

So how many games does it take playing at a certain level to make that your rating? That's something the statistician (and the official organizing body) has to decide. That's where the k-value comes in. If the chess federation decides it's 30 (which is what I think cliveW says the chess fed is using these days), then, crudely, here's what my new rating would be:

I played at a 1700 level over 10 games which is 100 points above my current rating. 10 games is 1/3 of the k-value of 30. So my rating increases by 1/3 x 100 points or 33 points. New rating = 1600 + 33 = 1633.

For our purposes - college women's soccer -- and specifically if an Elo system were to be used to determine the bracket, a k-value doesn't make sense at all because all games should count equally no matter whether they were played in the first week of the season or the last week.

For a predictive system, like Massey, you might well want more recent games to be weighted more heavily than earlier ones (and they do) but I accept the criticism that this wouldn't be appropriate for a rating system being used officially to determine the bracket.

kolabear · Oct 17, 2012

Soccerhunter said: ↑

OK, this is all helpful. Thanks Clive.

I am trying to simplify things for a basic understanding. This means trying to understand the big picture and context. (My personality type likes to first get a sense of the the lay of the forest before trying to look at the trees in detail.)

So what I am getting is this: In the world of ranking systems, there is, in fact, quite a bit of commonality.
(1) All systems primarily rely on W-L data to make up the rank ordered list.
(1.a) Systems based exclusively on W-L data are presumed to be more reliable and less open to capricious manipulation and therefor should be preferred if the correct formulas could be agreed upon.
(1.b) Even rankings based on polling presumes that the participants' opinions (votes) are significantly based on W-L data.
(2) There is the recognition that other factors are involved beyond W-L data.
(2.a) Which non-results based factors to use and how to weight them is the critical question and runs headlong into the desire to have a reliable system (see 1.a.)
(2.b.) The major difference between the Elo and RPI is dynamic versus static -- the dynamic weightings (K) being the principal non W-L factor.

The rest of the discussion is all quibbling over details -some of which may be significant in the minds of partisans.
...
Click to expand...

I have to say the difference (between RPI and an Elo rating) is much more serious than that. They're completely different paradigms and I'll try to come back to that later.

It may be useful again to think of chess ratings at first. Your rating depends on wins and losses of course but it also depends on the rating of the people you're playing against. If your record is .500, it matters whether you're playing players rated 1300 or 1400 or 2000 or 2000 (master territory). Now of course the RPI is trying, in a different way , to say who the opponent is matters, too - by calculating the opponents' won-loss percentage (Element 2 of RPI) and the won-loss percentage of their opponents, the opponents of the opponents (Element 3). But, as you can probably intuitively see, this tends to not be enough if your opponent is good but only has a .500 record because he/she/they are playing good opponents and not weak ones.

Now of course the key difference between chess and college soccer is in chess you have players with established ratings who you can gauge your results against while in soccer, you don't unless you use past years data - okay maybe for a fan's tool in the early parts of a season but understandably a no-no as far as the NCAA is concerned. How you overcome that problem is the next big trick which I'll try to come back to a little later unless someone who really knows what they're talking about (like Pipsqueak or Craig P) comes around here first.

cpthomas · Oct 17, 2012

Soccerhunter, one thing I'd add is that in addition to considering win-loss records, all the systems also consider the strength of the opponents against whom the wins and losses were achieved. The hard question that the systems must address is how to meld those two considerations into a single rating number. The academically developed statistics systems, including an Elo system, do it in a way that is considered theoretically sound. The RPI does it in a way that most academics would consider not theoretically sound, it's a sort of clunky Rube Goldberg approach.

BUT, the RPI actually produces rankings that are very close to what the other systems produce, in terms of the top 60 or so teams. This is all that matters from an RPI perspective since the RPI's sole purpose is to be used as one of several inputs for NCAA Tournament at large selection and seeding purposes, and the only teams that matter in the selection and seeding process are the top 60 or so.

ALL of the systems have problems if there are distinct playing pools with little or no correspondence between pools (i.e., few or no inter-pool games). Thus, in the Elo system used for chess, if you have two playing pools of 100 chess players each, with no inter-pool games, and with the array of win-loss records being identical for the two pools, each pool's ratings will exactly mirror those of the other pool. The top players will have identical ratings, the #2 players will have identical ratings, and so on all the way down to the bottom of the pools. What this means is that if one pool is grandmasters and the other is players just beginning to play, from a rating perspective it will look like the players in the two pools are equal. Albyn Jones himself confirmed to me that insufficient correspondence between pools is a problem every system will have. The more correspondence between pools, the lesser the problem, but it takes a good deal of correspondence to make the problem negligible. I suspect that the theoretically sound systems do better at maximizing the benefit of correspondence between pools, but I don't know that for a fact and I definitely don't know (and doubt) that it would make much difference for NCAA Tournament at large selection and seeding purposes.

As an interesting aside, you mentioned the possible need to consider game locations. Intuitively, that would seem like something to do. For Division I women's soccer, however, I don't think so. Going back to the playing pool problem, a result of the problem is that the RPI and the other systems tend to underrate strong pools (the grandmasters) and to overrate weak pools (the beginners). This is true for Division I women's soccer both in relation to conference pools and regional pools. Interestingly, home field imbalances also follow pool strength such that strong conferences and strong regions tend to have favorable home field imbalances. Thus although the rating systems' structures inherently tend to underrate strong pools and to overrate weak pools, home field imbalances push in the opposite direction of overrating strong pools and underrating weak pools. On average, however, the problem caused by the home field imbalances is smaller than the problem caused by the systems' inherent structures, so that the home field imbalance problem only partially offsets the inherent structure problem. That being the case, if one were to incorporate game locations into the basic structure, one would actually make the bigger problem even worse. This is a potential problem with using a system such as either Jones' or Massey's since both of them consider game locations and are supposed to produce ratings that have filtered out the benefits or dis-benefits teams have received from home field imbalances.

Regarding the RPI's playing pool problem, one way the NCAA has tried to address it, from a conference perspective, is through use of the Non-Conference RPI, in which teams' results against their fellow conference members are disregarded. This cannot be done, however, for the regional playing pool problem for the simple reason that there are not enough inter-regional games -- in fact a large number of teams play no inter-regional games.

cpthomas · Oct 19, 2012

Something is going on at the NCAA related to its recently established policy of releasing the weekly Team Sheets reports that contain teams' actual RPI ratings as well as a treasure trove of other information about each team. Until this week, at the NCAA's RPI page (at the NCAA.org website, not the NCAA.com website), there was a link to the NCAA's RPI Archive. That link now is gone. Also until this week, the NCAA had been posting the weekly Team Sheets reports at the NCAA RPI Archive webpage. The NCAA has posted the reports on Tuesdays for games through Sundays, September 23, September 30, and October 7. This week, however, they posted no report for games through October 14.

I'm not sure what this means. My concern is that it may mean the NCAA has had a policy reversal and will stop issuing the weekly Team Sheets reports. On the other hand, it is possible that non-publication of the October 14 report was inadvertent. The deletion from the NCAA's RPI page, of the link to the RPI Archive, however, makes me wonder whether a policy reversal is afoot.

I've sent an email to the NCAA's person in charge of statistics for Division I women's soccer, who also manages the RPI for DI women's soccer, asking if the non-publication of this week's Team Sheets report was inadvertent. So far, I have not received a response.

It will be very unfortunate if this signals that the NCAA has changed its mind about publishing the weekly Team Sheets reports.

kolabear · Oct 19, 2012

Here is a make-believe Elo rating for the top half of the Division 1 schools, based on games only through Oct 14, derived (bastardized) from the Massey Ratings and artificially re-scaled into the old Albyn Jones scale. Don't take this too seriously but it can be interesting to think about these things in Elo terms and can help some of us become more familiar with Elo-type ratings.

I artificially translated the Massey Ratings into the Albyn Jones scale - so this can only be a very loose approximation of what one can properly infer from Massey's ratings (but Massey doesn't help us by telling us what scale he's using). In the Albyn Jones scale: 100 point rating difference = 66.7% expected win percentage for the higher-rated team. 200 points = .800; 300 points = .889; 400 points = .941

homefield advantage usually ranged between 50-60 rating points (at 60 points, between two evenly rated teams, the team with homefield would have about .600 expected win percentage)

team rating*

1 Florida St 2115

2 Stanford 2005

3 Penn St 1995

4 UCLA 1990

5 BYU 1935

6 No Carolina 1930

7 Florida 1925

8 Duke 1925

9 San Diego St 1915

10 Texas A&M 1895

11 Missouri 1855

12 Pepperdine 1850

13 Virginia 1845

14 California 1835

15 Georgetown 1830

16 Maryland 1820

17 Michigan 1810

18 Santa Clara 1810

19 Wake Forest 1805

20 Marquette 1805

21 Virginia Tech 1790

22 Baylor 1785

23 Boston College 1775

24 Ohio St 1770

25 Oregon St 1760

26 Tennessee 1755

27 Wash St 1750

28 Notre Dame 1735

29 Washington 1730

30 Minnesota 1730

31 Texas Tech 1725

32 San Diego 1715

33 Long Beach St 1715

34 West Virginia 1715

35 UCF 1705

36 Mississippi 1705

37 Portland 1705

38 Utah 1700

39 Iowa 1700

40 Wisconsin 1690

41 Denver 1690

42 Kentucky 1690

43 Miami 1675

44 Northridge 1670

45 Ariz St 1655

46 Illinois 1650

47 Louisville 1645

48 Auburn 1645

49 Cent Mich 1640

50 La Salle 1625

51 USC 1625

52 So. Florida 1625

53 Cal Poly SLO 1625

54 Colorado College 1625

55 S F Austin 1620

56 UC Irvine 1620

57 Loy. Marymount 1620

58 Arizona 1615

59 CS Fullerton 1615

60 Texas 1610

61 Rutgers 1610

62 Dayton 1605

63 Oklahoma St 1605

64 Dartmouth 1605

65 New Mexico 1605

66 Princeton 1590

67 Miami Ohio 1590

68 William & Mary 1590

69 SMU 1580

70 Kansas 1580

71 Oregon 1575

72 Arkansas 1575

73 Alabama 1565

74 Colorado 1560

75 Iowa St 1560

76 Boston U 1555

77 Hofstra 1555

78 LSU 1555

79 Rice 1550

80 Utah St 1550

81 Vanderbilt 1550

82 Indiana 1550

83 Connecticut 1540

84 Navy 1540

85 Georgia 1540

86 So. Carolina 1535

87 Michigan St 1535

88 Memphis 1535

89 Purdue 1530

90 Oklahoma 1525

91 Charlotte 1525

92 UNC Wilmington 1520

93 No Texas 1520

94 Syracuse 1520

95 Nebraska 1515

96 UNLV 1510

97 Mid Tenn 1510

98 East Carolina 1510

99 Drexel 1505

100 Fresno St 1495

101 VCU 1490

102 Colgate 1485

103 Harvard 1475

104 Greensboro 1475

105 St Mary's 1470

106 Tulsa 1470

107 TCU 1470

108 Houston 1470

109 Kent 1465

110 Samford 1465

111 UC Davis 1460

112 Villanova 1460

113 Butler 1460

114 DePaul 1460

115 Miss St 1460

116 Northwestern 1455

117 Gonzaga 1450

118 San Francisco 1450

119 Clemson 1445

120 Valparaiso 1445

121 Pittsburgh 1440

122 Seattle 1440

123 Nevada 1430

124 Florida Intl 1430

125 Penn 1425

126 Delaware 1420

127 Richmond 1420

128 Santa Barbara 1420

129 Louisiana Tech 1415

130 Oral Roberts 1415

131 Fl Gulf 1410

132 UTEP 1410

133 NC State 1405

134 Boise St 1405

135 Furman 1405

136 Providence 1405

137 Milwaukee 1400

138 Northeastern 1400

139 St Francis 1400

140 Pacific 1400

141 Wright St 1395

142 Marist 1395

143 Wyoming 1395

144 East Mich 1390

145 Idaho St 1385

146 Hartford 1385

147 W Kentucky 1385

148 Ill St 1385

149 Brown 1380

150 Mercer 1375

151 No Colorado 1375

152 Kennesaw 1375

153 Montana 1375

154 Drake 1370

155 Mass 1370

156 Portland St 1370

157 St Joseph's 1365

158 Oakland 1360

159 St John's 1355

160 Detroit 1350

161 James Madison 1350

kolabear · Oct 19, 2012

The make-believe Elo ratings, even though they're make-believe, can help illustrate some general points and gives us a little different picture of the field.

An Elo system is not just a ranking but a rating. While a rating such as "1830" has no special meaning in and of itself, in relation to another team's rating (such as "1730") and given a definition of the scale that's being used, it has meaning - the team rated 1830 would be expected to have a win percentage of .667 against teams rated 1730.

While teams may be ranked fairly closely, say 5 or 10 teams apart, depending on what part of the scale they're on, the measured difference in the strength of the two teams (the ratings) may be significantly greater or lesser.

Going from #2 ranked Stanford (rating 2005) to #12 Pepperdine (rating 1850) is a 155 point rating difference - that corresponds to roughly a 75% expected win percentage for Stanford if they played on a neutral site.

The difference, however, between #42 Kentucky (rating 1690) and #52 South Florida (rating 1625) is only a 65 point difference, corresponding to a little over .600 expected win pct for Kentucky.

Some other general observations:
The theoretical "bubble" is somewhere around 1650 to 1675. The median of the Division 1 schools is 1350. So a theoretical "bubble" team has a 300 point rating advantage over the median team, or about a .889 expected win percentage.

The #8 school Duke has a rating of 1925. So an expected "elite eight" team has around a 250 rating advantage over a bubble team, or roughly an 85% expected win percentage. That's roughly about the norm in years past when Albyn Jones published his ratings.

cpthomas · Oct 19, 2012

The Team Sheets for games through October 14 now are posted in the NCAA's RPI Archive, which is great. It was inadvertent that they were not there sooner.

bigsoccerdad · Oct 21, 2012

Great RPI discussion. Thanks. But have a question and not sure if this was brought up earlier in this thread.. Will this year's RPIs be adjusted for the number of players out at the U20 World Cup for many of the top 20 teams. Any other team was smart to play these squads in the preseason with the hope of beating a weakened roster to accumulate RPI points. (Teams weakened: UNC, Duke, UCLA, Wake Forest, Penn State for example). Almost want to throw the first three weeks this year.