View Full Version : Comments and forum for a non-linear regression on the Crew (SAS)
mellon002
02 May 2004, 12:02 PM
I'm glad to see that you guys have worked this out. Let's let this be a lesson that this WILL NOT repeat itself in this forum. I post over in the DC board a lot and I've seen what lingering issues between posters can do to a board. This needed to be worked out. I'm glad you settled it but this will not be permitted again.
If you've got issues I can set up a thread in the Non-Soccer Forum and you can take it there next time.
I guess we have made it as a forum now that we've had our first feud.
End of discussion, let's talk soccer and stats.
taylor
04 May 2004, 10:10 PM
Yes, as I said in my pm to Numerista, I look silly saying I want collaboration and then become confrontational to people.
Well, I have waited to the last minute and will be doing the project tonight. I thought I would offer some results.
I am just putting the results up. Due to lack of time right now I won't go into interpretation.
Here is the model
proc syslin 3sls;
endogenous fans price;
instruments end price pop opp engtv spantv newsta;
model price=newsta pop;
model fans=end price opp engtv spantv newsta;
run;
the results follow
The SAS System 20:22 Tuesday, May 4, 2004 27
The SYSLIN Procedure
Two-Stage Least Squares Estimation
Model price
Dependent Variable price
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 395.1030 197.5515 395.40 <.0001
Error 90 44.96627 0.499625
Corrected Total 92 440.0692
Root MSE 0.70684 R-Square 0.89782
Dependent Mean 13.75591 Adj R-Sq 0.89555
Coeff Var 5.13846
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -36.0988 5.293780 -6.82 <.0001
newsta 1 1.516913 0.301529 5.03 <.0001
pop 1 0.047164 0.005204 9.06 <.0001
The SAS System 20:22 Tuesday, May 4, 2004 28
The SYSLIN Procedure
Two-Stage Least Squares Estimation
Model fans
Dependent Variable fans
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 3.1174E8 51956154 3.92 0.0016
Error 86 1.1386E9 13239808
Corrected Total 92 1.4504E9
Root MSE 3638.65467 R-Square 0.21494
Dependent Mean 16048.7097 Adj R-Sq 0.16017
Coeff Var 22.67257
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 30201.76 4944.333 6.11 <.0001
end 1 2138.942 938.2714 2.28 0.0251
price 1 -1391.76 399.7090 -3.48 0.0008
opp 1 974.2846 823.6166 1.18 0.2401
engtv 1 -471.722 1018.182 -0.46 0.6443
spantv 1 -1525.71 1104.740 -1.38 0.1708
newsta 1 6828.513 1723.334 3.96 0.0002
The SAS System 20:22 Tuesday, May 4, 2004 29
The SYSLIN Procedure
Three-Stage Least Squares Estimation
Cross Model Covariance
price fans
price 0.500 -443.171
fans -443.171 13239808
Cross Model Correlation
price fans
price 1.00000 -0.17231
fans -0.17231 1.00000
Cross Model Inverse Correlation
price fans
price 1.03060 0.17758
fans 0.17758 1.03060
Cross Model Inverse Covariance
price fans
price 2.06274 0.000069
fans 0.00007 0.000000
System Weighted MSE 0.9915
Degrees of freedom 176
System Weighted R-Square 0.8289
Model price
Dependent Variable price
The SAS System 20:22 Tuesday, May 4, 2004 30
The SYSLIN Procedure
Three-Stage Least Squares Estimation
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -36.9066 5.252789 -7.03 <.0001
newsta 1 1.476706 0.299748 4.93 <.0001
pop 1 0.047958 0.005164 9.29 <.0001
Model fanse
Dependent Variable fans
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 25009.79 4904.963 5.10 <.0001
end 1 2107.006 924.2440 2.28 0.0251
price 1 -946.952 396.5991 -2.39 0.0191
opp 1 885.7403 811.3537 1.09 0.2780
engtv 1 -512.898 1002.962 -0.51 0.6104
spantv 1 -1636.12 1088.282 -1.50 0.1364
newsta 1 5075.810 1712.470 2.96 0.0039
taylor
04 May 2004, 10:45 PM
here is the proc reg without adjusting for autoco:
(with tolerance/multico level on side right) As a reminder, end is weekend or not.
The SAS System 20:22 Tuesday, May 4, 2004 41
The REG Procedure
Model: MODEL1
Dependent Variable: fans
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 7 384059015 54865574 4.37 0.0003
Error 85 1066301384 12544722
Corrected Total 92 1450360399
Root MSE 3541.85293 R-Square 0.2648
Dependent Mean 16049 Adj R-Sq 0.2043
Coeff Var 22.06939
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| Tolerance
Intercept 1 -47802 32842 -1.46 0.1492 .
end 1 2073.26260 913.71941 2.27 0.0258 0.95712
price 1 -2301.18219 542.98725 -4.24 <.0001 0.09669
pop 1 87.48578 36.43615 2.40 0.0185 0.12120
opp 1 792.18416 805.28461 0.98 0.3280 0.93603
engtv 1 -556.40559 991.72149 -0.56 0.5762 0.91813
spantv 1 -1752.78337 1079.50091 -1.62 0.1081 0.90520
newsta 1 5917.73604 1719.83890 3.44 0.0009 0.18261
taylor
05 May 2004, 12:10 AM
I'm too tired to look back, but did I mention this is for REGULAR SEASON demand ONLY?
taylor
05 May 2004, 02:53 AM
ok, I have been up for a long time (see my ffa thread), but here are some of the results. I had to lag the hell out of the price equation to make the newsta significant(lag15), to retard the autoco ( I can't remember if I can do that though). I wouldn't put any stock in the price equation. I really wanted to get some more supply side data, but the Crew were too difficult.
Ok, I really need to go to bed.
Here it is
Model fanse
Dependent Variable fans
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 25009.79 4904.963 5.10 <.0001
end 1 2107.006 924.2440 2.28 0.0251
price 1 -946.952 396.5991 -2.39 0.0191
opp 1 885.7403 811.3537 1.09 0.2780
engtv 1 -512.898 1002.962 -0.51 0.6104
spantv 1 -1636.12 1088.282 -1.50 0.1364
newsta 1 5075.810 1712.470 2.96 0.0039
dependent Price variable=
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 395.1030 197.5515 395.40 <.0001
Error 90 44.96627 0.499625
Corrected Total 92 440.0692
Root MSE 0.70684 R-Square 0.89782
Dependent Mean 13.75591 Adj R-Sq 0.89555
Coeff Var 5.13846
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -36.0988 5.293780 -6.82 <.0001
newsta 1 1.516913 0.301529 5.03 <.0001
pop 1 0.047164 0.005204 9.06 <.0001
Guinho
06 May 2004, 11:50 AM
Well, not to get too fussy about it, but numerista is just being boneheaded here. Not trolling, to be sure, but misguided, at least from where I sit.
But first, a disclaimer disclaimer, taylor and I are buds and if ever you are in Bloomington one or the other or both of us will take you out for a beer and soccer.
NOw, back to statistics.... Of course, you folks do realize that this is EXACTLY a reenactment of a pretty longstanding and classical debate in statistics, right?
First of all, the term "significant" is defined conventionally. 0.05 and 0.1 are frequently used, mostly because we have 10 fingers, and we like base ten, so these are chosen as cutoffs. HOWEVER, any p value is a statement that given the assumptions of any given model (especially how the errors are distributed) of the probability that the given differences could have been generated by chance variation (error) alone. In so far as this is true, saying theres a 5% chance that chance could have caused the pattern and 7% chance is not hugely different, and it is (in my, rather radical opinion) intellectually dishonest to claim they are. Personally, I'd like to ban the term significant completely, and let people think about p values as the continuous variables they are.
Secondly, just about any model fit to real data will have violations of the assumptions, including those about the distribution of the errors. Now I imagine you are assuming a Gaussian distribution of errors, but if the "actual" unknown error distribution is somewhat different, then your calculated p-value is also off. So, here we are left in the situation of claiming that a calculated p-value of 0.069 is somehow a radically different thing than a 0.049 calculated p-value EVEN THOUGH WE KNOW that calculated p-values are in effect themselves estimates based on our approximation of the 'true' error distribution with a Gaussian distribution. Pardon me while I express a little sadness that we'd get head up about such fine distinctions built on sand.
Third, (and Taylor nailed this one ) since p-values are statements about the probability that something is the result of chance alone, we have to assess what kinds of probability will get our attention. 0.05 is pretty commonly used in some 'industries' (although in my field, ecological genetics, there is lively discussion about whether 0.05 or 0.1 should be the standard, given the kinds of data we work with and the nature of finding patterns in the field. It's a long story). However, depending on the context, you may want to only act if it's really clearly not just chance (so only a really low p-value will cut it with you) whereas in other you may take the best on greater values. In this case, the Crew might decide that a 7% probability that it's due to chance isn't high enough for them to discount the possibility that televising in Spanish reduces the gate. Perhaps they'd want to act on that, or at least explore it further. This is the sort of consideration you lose when you insist on a hard interpretation of the term "significant"
Finally, someone in here made a statement about a "bogus" correlation. Data models are models of the data at hand. Nothing more. They are descriptions of the pattern in that particular data set. In that sense, no correlation is spurious. However, as we all know, correlation isn't causation, and as several of you suggested this correlation between spanish language televising and attendance may in fact be driven by some other variable external to the model. That's always a possibility that requires consideration. So, it's not the correlation that's bogus, but the interpretation of it. O.k. I will freely admit I'm being annoyingly pedantic on this last point, but as you can tell, I am one of those annoying philosophical types who always keeps in mind what the numbers mean and what they do and don't tell you based on the conceptual constructs behind them.
O.k. that's my two cents, and I'm hoping now one gets to fussy, since smarter folks than us have tackled some of these issues and the ones underlying them and haven't come up with solid conclusions, so we shouldn't be too worried about disagreeing on some of these points.
See yah!
D.
taylor
06 May 2004, 06:29 PM
Thanks for taking the time to fully and properly explain some things Guinho.
Well, I am finished with the stats project and was not too happy about the final results. The results for the three stage equation are trash, so I think for the time being, I will ignore the supply side of the equation. I think we will have to settle for the ordinary least squares equation for attendance (the regression that started that big P value debate) .
Over the summer I would really like to run a better fit demand equation, so I will go back and include some suggested variables such as weather, whether McBride started, temperature and redefine the opponent variable.
I would be endebted if you all have any other suggestions or thoughts to contribute. Also, if any of you want to help collect data or colaborate on a project for a different team, it would be cool. Or if any of you have inside access to MLS people, I would be appreciate any info they could provide.
Finally, I hope this thread can turn into a forum on regressions for soccer.
Thanks
Taylor
ur_land
07 May 2004, 12:47 PM
I'm not sure if "whether McBride started" would make a difference. Starting lineups are not usually announced before the game, right?
OTOH, maybe there would be an effect if McBride was injured, and people knew he wasn't going to play......might be worth checking out in any case. Have fun with these analyses, and thanks for your contributions.
numerista
07 May 2004, 06:39 PM
Personally, I'd like to ban the term significant completely, and let people think about p values as the continuous variables they are.
Just saw this post ... here is my original recommendation again:
Rather than (inaccurately) claiming to be "working with a 7% significance level," you should really just state that the Spanish-language TV coefficient came out to be borderline significant.
Using qualifiers like "borderline" is a way of allowing people gray area to think about p-values. So Guinho appears to agree with my recommendation; however, he seems to be missing the point that if you are going to use the term "significant," it is technically incorrect to set your threshold after inspecting your results. In statistical circles, this is not a matter of debate.
Guinho
17 May 2004, 09:37 PM
So Guinho appears to agree with my recommendation; however, he seems to be missing the point that if you are going to use the term "significant," it is technically incorrect to set your threshold after inspecting your results. In statistical circles, this is not a matter of debate.
Actually, technically, significant is whatever you decide it should be. Usually taken to be 0.05, but in some areas 0.1 is used. So, I'd have to say from a technical standpoint, Taylor's right in saying that "I've chosen an 0.07 significance level" on the other hand, in the interests of clear communication, it's probably best to stick with either 0.05 or 0.1, since that's what most of your readers will be thinking. So, from a pragmatic standpoint I'll go with numerista, but I'd go technically with taylor.
Do I have a future in politics or what? I wonder if John Kerry will consider me as a running mate.... :)
Next up: my explanation of why we should sometimes end sentences with prepositions (no kidding!!)
I have to say I miss this kind of think in my current job....
G.
taylor
18 May 2004, 11:44 AM
[QUOTE=
I have to say I miss this kind of think in my current job....
G.[/QUOTE]
Well, we can talk all about it here. The only problem is that I am in Berlin drinking my beloved beer for the remainder of the week. After that though I will be doing macro economic policy work in Jena, but I am sure I can find a way to relate sas econ dev to german soccer. Does anyone here think the EM 2004 would be an excellent economic development study??? Bueller Bueller Bueller?
In any event, I will cheers to the BS SAS bretheren tonight.
Cheers,
Taylor
numerista
19 May 2004, 02:00 PM
Actually, technically, significant is whatever you decide it should be.
I'll give it one more try.
1. It is valid to choose whatever significance cutoff you find useful. I have never suggested otherwise.
2. The significance level does have a precise mathematical formulation:
Pr(reject null hypothesis | null hypothesis is true)
A significance cut-off must answer the question, "What's the probability that I would have rejected the null hypothesis by mistake?" So if you claim a significance level of 0.07, you're actually claiming two things:
--any p-value below 0.07 makes it worth rejecting the null
--any p-value above 0.07 isn't strong enough
Taylor's choice seemed to be based on the first consideration, while ignoring the second one.
taylor
20 May 2004, 03:54 AM
I seriously can not believe I spent a half a day during my finals period debating the p-value on bs.
Now that I am on vacation, I realize I must have been pretty anal to get so worked up about such a small and minute point in a regression that really has many more interesting questions than the p value.
Now back to the real stuff. Does anyone have some ideas, in terms of variables, for modeling?
taylor
26 May 2004, 02:13 PM
You all are going to piss your fukciung pants.... cause I just did.
I just finished talking with a prominent german soccer economics professor here at the insitute in germany and he wants to write a paper with me on soccer. Specifically economic behaviour of Germany to say... say... the US!. HA. So I will talk to him about everything and see what specifically we should do, but I wanted to share the tentative good news with my bs blood brothers!
The academic mls soccer maidenhood will soon be broken gentleman.
G. you would really enjoy it here.
Cheers guys.
Taylor.
mpruitt
26 May 2004, 02:15 PM
Hah that's great to hear the good news for you personally and I think you just had the best line ever in this forum.
mpruitt
17 Jun 2004, 02:34 PM
I've been poking around this hockey metric site a lot obviously and found this.
http://www.puckerings.com/research/attend.html
It's about a similar study to yours which was done for hockey.