On the stupidity of computerized ref ranking systems...

mrgifted · Nov 4, 2003

Here in IL, in our NFHS system, a computerized referee ranking system is used to determine who is selected for playoff matches. The rankings are done primarily by coaches, though high ranking referees also have a password that allows them to rate referees they have seen work. The scales are anchored as 1-5, with 1 = excellent and 5 = failing.

The stupidity with any scaled response framework such as this is the inherent bias that such scales capture as users try to mentally separate consecutive numbers (i.e., is the ref a 3 or a 4?) or use the system without diligence (i.e., we played poorly, so the ref must have been a 3 that day.)

As an academic interested in cognitive scaling, and a referee who falls victim to it from time to time, I conducted an informal survey of coaches using the system to assess behavioral patterns. Many used a reasonable rubric for assessing referees. However, a single response to my questioning was disturbing, and worth sharing here.

In Illinois, there is a single loaded factor that accounts for a disturbingly high amount of likelihood that a ref's score will be lowered radically: DOES THE REFEREE WHISTLE THE INADVERTENT HANDLING?

That is right ladies and gents. Whether the ref makes the incorrect handball call or not was a major factor in deciding how the ref would be rated on the game. Now, this fact alone was not too disturbing. However, the element that is truly upsetting is that coaches were much more likely to rank a ref as a 4 or 5 if they CALLED HANDLING CORRECTLY. Only when the ref made the incorrect decision (i.e., whistled incorrectly) were they in general likely to achieve a 1 or 2 rating.

Why this long rant? Officials in my small sample qualitative study (I will share the research methods with interested people) are effectively removed from playoff contention unless they are inclined to misapply one of the most contentious myths of the game. An overall rating of greater than 2.5 or so seems to be the inflection point between post season action and sitting at home. That rating was significantly more likely (alpha = .10) when the official made the correct calls in this area.

This is only a preliminary finding of this research, which is ongoing, and the next stage will be a more comprehensive assessment of the effectiveness of the system using a larger sample. But the current findings suggest that coach evaluations are less valid (scientifically) than may initially meet the eye. More results to follow...

MrG

rcleopard · Nov 4, 2003

So long as it is used uniformly for all referees, (and by uniformly I mean there are no other metrics that go into the calculation) , then I don't see a problem with it. Coaches can score refs, assessors can score refs, and league officials can score refs. All refs are judged through this system, and therefore any latent bias would be factored amongst all refs.

I hate to break it to you.. but this system would be just as "evil" and "bias" were it on paper, computer, abacus, etc.

Jarrod

whipple · Nov 4, 2003

Originally posted by rcleopard
So long as it is used uniformly for all referees, (and by uniformly I mean there are no other metrics that go into the calculation) , then I don't see a problem with it. Coaches can score refs, assessors can score refs, and league officials can score refs. All refs are judged through this system, and therefore any latent bias would be factored amongst all refs.

I hate to break it to you.. but this system would be just as "evil" and "bias" were it on paper, computer, abacus, etc.

Jarrod
Click to expand...

Actually, if the survey is as Mr. G. describes it, then it certainly would be flawed and prone to bias, and of little more value than a popularity contest. There is no control. You have a non-homogenous mix of subjective observations of diverse occurances.

You are correct in that this has nothing to do with computers. It would be invalid no matter how you processed the data. Rather you have both observational errors and design errors. Such errors do not "factor out" irrespective of how it is fielded or sample size. In fact, the more you increase your base, the more unreliable the findings become.

Sounds like a classic GIGO.

Sherman

propes · Nov 5, 2003

"What If Refs Evaluted Coaches?"

http://www.refblog.com/pivot/entry.php?uid=standard-75#body

Sounds like a similar sitaution.