View Full Version : Comments and forum for a non-linear regression on the Crew (SAS)
ChrisE
30 Apr 2004, 02:03 PM
PEOPLE! PLEASE? We're all here for the love of numbers, and horriably esoteric analysis of them! Don't fight! We're all in this together!!! Lol you guys are having a pretty heated disagreement and believe me when I say, I have absolutely no idea what you're talking about. This thread has turned extremely weird.... Kudos to the both of you though for mixing it up on a topic like this.
In order to try to break the tension here, as finals have clearly gotten to numerista and taylor (yet I remain blissfully unconcerned with my paper due in two hours), let me try a comically bad explanation of what they're discussing. Probably my huge errors will make these two quit caring about the pretty insignificant problem they're having.
The idea behind a regression, basically, is to use some variables to get an equation with which you can predict some other variable, in this case attendance. So, in a non-soccer context, you might want to be able to guess a person's weight from his height. So you do several statistical computations (mostly involving expected values, I think, but it's not important), and you get an equation. I did the height/weight thing for MLS players, the equation I get is y= -177.2+(4.89x). So, if you've got a guy who's 60 inches tall, you get his expected weight as -117.2+ 4.89(60)=116.2lbs. The strength of this relationship is really strong (r=.796), so 63% (r^2) of the variation in weight is accounted for by height. If you've got a guy of average (for soccer) height, 71 inches, you'd expect him to weigh (4.89*71)-177.2 = 169.99 lbs. MLS players who are 5-11 have actually averaged 169.1 lbs.
What taylor is doing is trying to predict Columbus Crew attendance using way more variables. In the last case, -177.2 lbs was the intercept, it's sort of a base, and is going to be the same regardless of height. In the Crew's case, the intercept is -38411 people, but that's not really important. And I may be misunderstanding.
What is important are all the other variables; so, for price, taylor's model says that $1 in price change reduces attendance by 2296 fans. You can do this whichever direction and whichever amount you want, so we can interpret it also as saying that dropping prices a dollar would increase attendance by 2296 fans, and increasing prices by $.50 would lose the Crew 1148 fans/game. Likewise, he gets that televising a game on Spanish tv causes a mean drop of 1696 fans/game. However, numerista's problem with this is that the variance (so some games may havea loss of 3000, and some a loss of 300) a is too high, so it's possible (though pretty unlikely) that this can be accounted for just by chance fluctuations in attendance; the industry standard for significance is a little bit below what taylor got for Spanish television, and I think it really bothers numerista that he decided to lower his standard for significance after he saw his numbers.
In taylor's defense, if you expect that Spanish television is going to decrease attendance (as opposed to saying it will change it, but you don't know in which direction), spanish television becomes significant even by numerista's strict standards.
Just in explaning taylor's results a little bit more:
Crew Stadium caused an average increase of 6459 fans.
Televising a game on Spanish tv causes (maybe) a drop in 1696 fans.
Televising a game in English doesn't affect attendance.
Your opponent has no effect on attendance.
The population of Columbus does not affect attendance.
Increasing prices by $1 reduces attendance by 2296, and vice versa.
Games played on weekends average 3146 more fans than games on weekdays.
This model accounts for 27% of attendance variance, so you're not going to get hugely precise results from it. Other factors - weather, Andrulis still being around, stuff nobody's thought of, simple random variation have a huge effect on what the attendance for any individual game is. However, the more games you use this for, the closer the model's predicted mean should come to the actual attendance mean.
I hope i was clear and didn't make any huge mistakes. Additionally, I hope the statisticians don't mind a rank amateur moving in on their territory, and I definitely hope they'll correct everything I got wrong.
taylor
30 Apr 2004, 02:11 PM
Three quick things. It's assuming a linear relationship, I never had decided on a "standard", nor did I think a range between .1 to .05 to be problematic and I am no expert.
Thanks a lot for clearing some stuff up.
numerista
30 Apr 2004, 02:14 PM
Taylor, this isn't such a difficult point to understand, but you're throwing around so much attitude that you've missed it entirely. I've said it, and ur_land has said it, too -- regardless of what threshold you choose, you need to choose it before you run your analysis.
Note that the steps in this link have an order to them ...
http://rimarcik.com/navigator-en/hypotezy.htm
taylor
30 Apr 2004, 02:48 PM
Taylor, this isn't such a difficult point to understand, but you're throwing around so much attitude that you've missed it entirely. I've said it, and ur_land has said it, too -- regardless of what threshold you choose, you need to choose it before you run your analysis.
Note that the steps in this link have an order to them ...
http://rimarcik.com/navigator-en/hypotezy.htm
OMG.
Umm no disrespect, but I was expecting more than some Slovokian ( I think) website when I said citation, e.g D. Gujarati. Second, even using your own Slovokian website, the site itself says one can "conventionally" use a .05. I don't know how you define conventionally or what the word is in Slovokian, but the word implicitly means you CAN use other "expert standards", so again why such grief?
Also, for clarity, I was hoping you would respond to the yes no questions, just so we can be perfectly clear. If you imply that I misleading people or am opperating incorrectly, you need to back it up. As I said before, I am no expert, it is quite possible that I am missing something, but I still haven't heard anything substantive yet.
I am no longer clear if your problem is methodology or initial findings. If you feel I should have set a level before I started, that's entirely fine. I respectfully disagree. I am in a camp that believes in interpretation. I don't say this out of self-interest, merely because these are estimations and require a degree of tolerance when reading results, particuliarly underfit ones.
Perhaps it is due to my insufficiencies in this stuff, but I frankly don't understand your case. The findings are acceptable under your own website.
ur_land
30 Apr 2004, 04:26 PM
Likewise, he gets that televising a game on Spanish tv causes a mean drop of 1696 fans/game. However, numerista's problem with this is that the variance (so some games may havea loss of 3000, and some a loss of 300) a is too high, so it's possible (though pretty unlikely) that this can be accounted for just by chance fluctuations in attendance; the industry standard for significance is a little bit below what taylor got for Spanish television, and I think it really bothers numerista that he decided to lower his standard for significance after he saw his numbers.
ChrisE,
This was a really good description of a regression. You described numerista's objection pretty well, but I'll go into it a little more.
I think Taylor and Numerista are arguing past each other about different points. I think Numerista knows and acknowledges that .05 is an arbitrary cutoff point that is the typical threshold for statistical significance in academia. Taylor, I don't think that Numerista cares if you use .05, .01, .07, or .54321. He just wants you to pick a significance level beforehand, because this is how the logic of statistical testing is supposed to work.
I think Taylor, as he says in his previous post, is operating more from the perspective of how people actually do stats. You run your test and see what you get. Most of the time people implicitly are using the .05 criteria when they do this. For this reason, I really don't think that you should say they are significant at the .07 level; just use .05 and call the spanish TV result marginally significant. Substantively, it takes nothing away from the impact of your results and it prevents red warning lights from going off in the heads of people like numerista.
For those that don't know much about statistics, this is a contentious issue becuase you can never prove something with 100% certainty when you use inferential statistical tests. You can be 95%, 99%, 99.9999999.....999% certain that your result isn't what's called a type-one error (or false positive), but there is still some doubt. Setting the significance level tells you what level your chance of having a false positive is set at. When the level is set at .05, there is a 5% (or 1 in 20) chance that a significant result could have been arrived at not through any meanignful relationship between the variables, but purely through chance. However, for reasons that I won't go into here, this logic is only true if you set your level before you do your test. Which is why numerista is annoyed.
In the last case, -177.2 lbs was the intercept, it's sort of a base, and is going to be the same regardless of height. In the Crew's case, the intercept is -38411 people, but that's not really important. And I may be misunderstanding.
The intercept is what the value of the predicted variable would be when the predictors equal zero (it's just like the y-intercept when you graph a line: y=mx+b). So the predicted weight of an MLS player that was 0 inches tall would be -177.2 pounds. Likewise, in Taylor's data, the predicted attendance is -38411 when price is $0, when the population of Columbus is 0, and when the values of newstadium, spanish TV, english TV, opponent, and the weekend variable are all zero (are these dummy coded? what are the codes?). So, yeah, not very meaningful.
In taylor's defense, if you expect that Spanish television is going to decrease attendance (as opposed to saying it will change it, but you don't know in which direction), spanish television becomes significant even by numerista's strict standards.
Actually, if I understand Taylor correct, these all already are one-tailed tests, which I think is a bigger sticking point than the p-value dispute. When you test a hypothesis, the test can either be two tailed (or non-directional, i.e., teams in white nike jerseys have a different average ability than teams in green nike jerseys) or the test can be one tailed (or directional, i.e., teams with white nike jerseys have a higher average ability than teams in green nike jerseys). If the test is one-tailed, your criteria are looser-- your test statistic needs to be half as large as it is for a two-tailed test to reach the same level of significance.
The caveat to doing one-tailed tests is that you need to have some really really good a priori (before the fact) hypothesis to justify doing a one-tailed test. I'm assuming Taylor has those, even though he didn't express them, and if he was actually going to publish this, reviewers would expect him to justify the use of a one-tailed test with those hypotheses. YMMV, but in some academic fields (like social psychology and neuroscience, which are my fields) one-tailed tests are generally frowned upon. Not to say that a one-tailed test was wrong here--it's just that I think Taylor should have said up front that these tests were one tailed and then given his justifications for doing so.
Hope this helps clear things up a bit.....
mpruitt
30 Apr 2004, 04:40 PM
Thanks for the clarification guys. Good luck with your stuff taylor. Seems like you've taken on a pretty big task here and done a commendable job. We all should be open to agrressive peer review with whatever we do on here, I'm just getting a kick out of this because it's so relatively obscure. Finally we see the type of debate on this forum that's normally reserved for promotion/relegation, Chivas USA, and whether Chris Armas sucks on other parts of bigsoccer!
numerista
30 Apr 2004, 04:44 PM
For this reason, I really don't think that you should say they are significant at the .07 level; just use .05 and call the spanish TV result marginally significant.
Very articulate post, ur_land ... I've singled this out because it's pretty darn close to my initial comment, which didn't come with the elaborate explanation.
My other primary reason for being concerned about about Taylor's results is that there are known important factors (e.g. seasonality, special events) that are easily measured and are known to impact attendance but have not been taken into account. As such, some of our "effects" may be due to bogus correlations. (Ironically, this is similar to your objection to ChrisE's offside study, although that one didn't come accompanied by such grandiose claims.)
ur_land
30 Apr 2004, 05:01 PM
My other primary reason for being concerned about about Taylor's results is that there are known important factors (e.g. seasonality, special events) that are easily measured and are known to impact attendance but have not been taken into account. As such, some of our "effects" may be due to bogus correlations. (Ironically, this is similar to your objection to ChrisE's offside study, although that one didn't come accompanied by such grandiose claims.)
Thanks for the props--it's always nice to be called articulate.
The spurious correlations issue is an important one, and it would be nice for taylor to go back and try to code for special events and other issues (by seasonality, do you mean spring/summer/fall or do you mean changes from season 1 ot season 10?). Then again, his model explains 27% of the variance--perhaps the rest of the variance is eaten up by $1 brat night and USMNT doubleheaders.
However, despite all of the gripes that we have, I do want to say thank you to Taylor for doing this and posting it publicly. None of my criticisms are made meanspiritedly, and I think it's great that we're starting to get a group of people interested in applying statistical analyses to soccer. Now if I can just figure out a way to get paid for doing this (anyone in MLS need an ABD grad student that will have his sheepskin by the end of the summer?).
taylor
30 Apr 2004, 05:22 PM
thanks Ur_land and co.
First, I was only using a one tailed test on spantv because of all the informal info on the negative impact of tv on fan attendance. The problem with spantv, imo, is the se. But if you want to use a two tailed test at .05, then sure it is marginal. I again am completely open to interpretations. Also, assuming the spantv to be significant, (at whatever level you want) the se is so large, relative to the mean, that it becomes "marginal".
Second, if I have not clearly stated that the model was underfit and that some variables were obiously missing, I apologize. Although I do take issue with people misrepresenting information and statements, I will blissfully accept my own imperfections to say I don't know a lot. I only have a year's worth.
Third, if you don't like the initial methodology that is perfectly fine. I am open to your interpretation, but respect that I am in a different camp. The EPA, OMB, Moody, and academia use different levels depending on the study. They define the industry so I feel quite comfortable using a different level than .05 (and INDEED, they use several levels in their analysis). I believe risk should be assesed looking at the entire model, not necessarily before one starts the model.
Finally, you need to realize that I have a time and information constraint. I tried contacting the Crew several times, they never called back. I've told people that if they have the desire to get more data, I will gladly cite them and incorporate it into the model. I'm not getting paid for this and I have no intention to publish anything in stats (I'm a political scientist at heart). I do however hope to pass this semester.
So thanks again Chirs and Ur and Maxim.
numerista
30 Apr 2004, 05:42 PM
The spurious correlations issue is an important one, and it would be nice for taylor to go back and try to code for special events and other issues (by seasonality, do you mean spring/summer/fall or do you mean changes from season 1 ot season 10?).
I meant spring/summer/fall (from Kenn's data, it looks like conceivably a quadratic time-of-year variable) ... I expect that the true underlying cause for this is that people are away on vacation in the summer. But you're also right to imply that the age of the league is something else that Kenn has shown to be very important ... in particular, the year-one novelty effect.
numerista
30 Apr 2004, 05:48 PM
Although I do take issue with people misrepresenting information and statements.
Sorry to the rest of the forum for rising to this, but I've been reasonably patient through Taylor's "expert" sarcasm, and his/her paranoid suggestions of my running a Slovokian [sic] website.
Now Taylor, please indicate where anyone on this thread apart from you has misrepresented information or statements. Otherwise, be a grown-up and apologize.
ChrisE
30 Apr 2004, 05:59 PM
Sorry to the rest of the forum for rising to this, but I've been reasonably patient through Taylor's "expert" sarcasm, and his/her paranoid suggestions of my running a Slovokian [sic] website.
Now Taylor, please indicate where anyone on this thread apart from you has misrepresented information or statements. Otherwise, be a grown-up and apologize.
(Ironically, this is similar to your objection to ChrisE's offside study, although that one didn't come accompanied by such grandiose claims.)
First however, let me offer the caveat that I opperating with a 7% level of significance.
The first thing I would like to say is that the model is obviously underfit, so the results should be intepreted with a high degree of flexibility when thinking of the results.
I didn't think taylor's claims were grandiose - I thought he was being pretty realistic about how useful his study was. But that's not hugely important right now, either. I don't particularly care about laying blame here, but I think both of you have been more than a little too harsh towards the other party. There's little doubt in my mind that numerista is a fair-minded guy, and as far as I can tell, so is taylor - so, I hope, please, that we can put an end to this and move on.
Perhaps we should all just step away from the thread for a week or so.
taylor
30 Apr 2004, 06:23 PM
Sorry to the rest of the forum for rising to this, but I've been reasonably patient through Taylor's "expert" sarcasm, and his/her paranoid suggestions of my running a Slovokian [sic] website.
Now Taylor, please indicate where anyone on this thread apart from you has misrepresented information or statements. Otherwise, be a grown-up and apologize.
This is trolling out of control.
As the self proclaimed expert, you have remained conspicously reticent on all my points of question (i.e. the yes no questions).
I also expressed that I am "he".
Examples of misrepresentaiton are:
1)You wrote "Then he/she changed the significance cut-off from 0.05 to 0.07, just so that he/she could claim that the result was below the cutoff." By using the words "just" and "claim" it implies I was misleading.
2) You wrote "It is incorrect -- a significance level is defined as the probability one would reject in the case where there is no signal. By claiming to be working with a significance level of .07, Taylor is claiming that he/she would not have rejected if a p-value had come out to be .0704."
Sorry dude, but I repeatedly said I was interpreting. BTW, signal??? what are you talking about?
3) you are claiming to be an "expert", by claiming "the industry only works at a .05" (which as throughly expressed does not exclusively do so).
4) The "expert scarcasm" is funny because I keep hoping that you provide something substantive for me even to discuss.
5) Where oh where in my "the model is underfit", "take with a grain of salt" and "these are only estimations" do you get grandiose?
6) You are unhappy with my methodology. To beat the dead horse, I am perfectly fine with that contention.
7) you have offered no substance to the discussion, other than people have to infer from you that you weren't happy with my methodology.
Futhermore, you have relied on other people to bail you out. As an industry "expert" I guess I expected more than a Slovokian homepage (btw I presume it is not yours) for a industry wide citation (incidentily, what industry are you talking about anyway?). Also, your own bloody webpage clearly indicated that .05 level was not exclusive.
Ok dude, I suggest we continue this through private pming. Because I am done with your facade.
Any other discussion with you will be held via pming.
numerista
30 Apr 2004, 06:35 PM
I'm not to get drawn into this, but I will point out that Taylor's post contains at least one fabricated quote from me ... imo, that's beyond the standards of civilized debate.
ur_land
01 May 2004, 01:41 PM
http://www.rageboy.com/images/spam-happy-person1.jpg
Now that we've had our first large-scale fued, does that make us a legitimate bigsoccer.com sub-community? Makes me want to do an ethnography of our development.
No, in all seriousness, there's no need for anyone to get super snippy about this. I think both parties simply misunderstood what the other was trying to say, and, like all internet discourse, things got too heated too easily. Both Taylor and Numerista are good contributors to this board, and I'd hate for either one to withdraw from us over this.
mpruitt
01 May 2004, 01:44 PM
here here.
taylor
01 May 2004, 01:50 PM
Funny pic Ur :).
I kick my feet in the sand and apologize to Numerista.
Chris is right, finals make me pissy. I guess I still haven't learned proper bigsoccer decorum yet.
I'll keep y'all up to date.
thanks for everyone's contribs...
ps, on another topic, has anyone considered the adu spike in attendance as simply a transfer, rather than increase in demand?
mpruitt
01 May 2004, 01:56 PM
. I guess I still haven't learned proper bigsoccer decorum yet.
LOL taylopr this is one of the funniest things I've ever seen posted on this board. Bigsoccer decorum most of the time isn't much of decorim at all. There's tons of flame wars and trolling about various standard topics, US/Mexico, Promotion/Relagation, Chivas USA and all sorts of player stuff. This forum around here though is usually more like the Big Soccer Chess Club. It's big of you to apologize or whatever but it's just too bad numerista and you got into it over something that's so really intellectualized and substinative beyond the normal Bigsoccer fray.
ChrisE
01 May 2004, 02:23 PM
Funny pic Ur :).
I kick my feet in the sand and apologize to Numerista.
Chris is right, finals make me pissy. I guess I still haven't learned proper bigsoccer decorum yet.
I'll keep y'all up to date.
thanks for everyone's contribs...
ps, on another topic, has anyone considered the adu spike in attendance as simply a transfer, rather than increase in demand?
I agree with maxim, man, that's respectable.
As for the Adu attendance thing, as it happens, your mortal enemy suggested it in this here thread:
http://www.bigsoccer.com/forum/showthread.php?t=105878
:shock:
numerista
02 May 2004, 09:26 AM
Thanks for the classy post, Taylor ... no hard feelings here, and I'm sorry that we both got a bit out of hand.