PDA

View Full Version : First or Last 15 Minutes Most Important?


Pages : 1 [2] 3

the101er
09 Sep 2003, 03:32 PM
There is, according to the chi square test, no significance to the fact that more goals are scored at the end of games than at the beginning. Just looking at the data, this makes me wonder if I am running the test correctly, or running the wrong test.

Null hypothesis: Goals are scored evenly through out every soccer match.

Then for 100 goals scored in 15 minute increments, you would expect 16.667 goals in each increment.

Normalizing the data for the EPL in 2002-2003 (quite easy, really, since exactly 1000 goals were scored) and assuming I am off by a goal or two, gives.
Normalized actual goals scored:
11.40, 12.80. 16.50, 18.00, 16.10, 25.20

Looking at that data you would probably say, "More goals are scored at the end of matches." But, the Chi Square test doesn't reject the null hypothesis stated above, so we must assume that goals are scored evenly through out the match.

Suggestions? What am I doing wrong?

Also, in looking for stats on the Internet you quickly find out who has them all: betting houses. And they're not going to give them up for free.

beineke
10 Sep 2003, 11:51 AM
Originally posted by the101er
Suggestions? What am I doing wrong?

You normalized the data before running the test.

The actual p-value is 0.00000000000008. There is no question of significance here.

But as I said above, this isn't a particularly interesting comparison.

A chi-square on the full 20-row table would answer the question of whether some teams start faster than others (under certain assumptions).

superdave
10 Sep 2003, 12:19 PM
Originally posted by beineke
474 541 631 625 669 872

This suggest to me, strongly, that defender fatigue is a big factor in goals allowed. The only place where the trend doesn't hold is the only period in which the defenders are fresher (after halftime) than they were in the previous period.

the101er
10 Sep 2003, 03:10 PM
Good points and thanks for the information on P. I knew I was doing something wrong, and I think it was just simply interpreting "significance" backwards. In fact, the numbers are so far off the charts that they must come from a nonuniform population. So yes more goals are scored at the end of games. Not earth shattering, perhaps, but at least verified.

That's a great observance about the unexpected drop in goals immediately after halftime being due to recovery during halftime.

Are any of the statisticians here familiar with the Kolmogorov-Smirnof D Test? This seems like it might give better results than Chi Square, as this isn't a normal distribution. But to use it, I need 10 intervals (minimum) instead of just 6.

When I get a chance, I will run a chi square test on the whole table and post results.

Thanks again.

The Double
10 Sep 2003, 05:24 PM
Amazing thread 101.

I'm still trying to decide as to which 15 minutes are more important. The first 15 minutes can definetly set the tempo for either of the teams. But the last 15 minutes is what everyone is concerned about. It makes for great television ;)

Kevin in Louisiana
11 Sep 2003, 10:56 AM
superdave--interesting hypothesis about defender fatigue, although I don't know if we can make that that much over a six goal (approx. 1 %) difference between minutes 31-45 and 45-60, even though there is a pretty sizable amount of goals in the data set.

beineke--it would be nice to find data concerned with what you mention about the question of whether goals are meaningful in the last 15 minutes.

That leads to me a somewhat bigger question: what is the value of a goal scored in each 15-minute period (i.e. how much does a goal contribute to a team's chances of making the maximum number of points from the game)? This was discussed in a link from the Sabermetric thread. But to use the method there you would need to know the score when each of those goals were scored, something that would be a pain to gather up over the MLS's lifetime. So could there be a way of deriving a statistically useful formula for worth of a goal per 15-minute period without needing to know the score of the game at the time? I guess you could take some numbers and try to figure out an average of what the scoreline is at each point in the game and use that to derive a figure for worth of each goal per 15-minute period. It would still be a lot better to have the actual figures, though.

superdave
11 Sep 2003, 11:38 AM
Have any of you guys played Championship Manager? There are a few basic game things that I thought those guys modeled incorrectly. For one thing, they just assumed hot streaks and cold streaks, which don't exist in baseball. So I kinda doubt they exist in soccer. And they assumed "clutch" play existed, and in fact was pretty decisive. And, again, it doesn't exist in baseball. So to me, it's vastly more likely that such a thing doesn't exist than that it exists in one sport and not another.

And this is something else I thought they incorrectly modeled. Putting in fresh attackers didn't have the effect in the game I thought it should have, from my experience as a soccer fan. And this sort of confirms my impression.
Originally posted by Kevin in Louisiana
superdave--interesting hypothesis about defender fatigue, although I don't know if we can make that that much over a six goal (approx. 1 %) difference between minutes 31-45 and 45-60, even though there is a pretty sizable amount of goals in the data set.
With all due respect, you're not looking at it right.

+67 +90 -6 +44 +203

If you look at it that way, the -6 looks especially strange. There's a relatively steady pattern for 3 of the "changes," and everyone knows why there are so many more goals in the last 15 minutes...teams are desperate to score and take crazy chances, which lead to equalizers on the one end, or the game being salted away on the other.

So we're left with that anamolous -6.

My hypothesis is that by the nature of the sport, fatigue affects defense more than attack. So in each block when the fatigue factor goes up, so do the number of goals scored, and somewhat steadily. In the one block when it goes down, the number of goals does, too.

Kevin in Louisiana
11 Sep 2003, 01:40 PM
Yeah, it makes more sense that way. I hadn't really thought to look at it that way.

As for Championship Manager, your fatigue hypothesis definitely runs counter to CM since attackers tire out so much more than defenders (especially in CM 01/02, not so much in CM4). In CM 01/02 there was way too much fatigue for attackers and too little for defenders.

the101er
11 Sep 2003, 08:03 PM
0-15min 15-30min 30-45min 45-60min 60-75min 75-90min Total
Manchester United 6 14 14 11 11 18 74
Arsenal 16 16 9 15 11 18 85
Newcastle 6 6 14 12 7 18 63
Chelsea 7 5 17 15 5 19 68
Liverpool 7 5 5 13 17 14 61
Blackburn 9 8 7 12 7 9 52
Everton 4 9 8 11 6 10 48
Southampton 4 2 11 9 6 11 43
Manchester City 3 11 5 5 8 15 47
Tottenham 9 2 9 6 12 13 51
Middlesboro 1 9 10 10 8 10 48
Charlton 6 5 10 10 4 10 45
Birmingham 2 2 7 4 9 17 41
Fulham 5 4 12 4 7 9 41
Leeds 10 7 8 11 6 16 58
Aston Villa 6 4 8 6 7 11 42
Bolton 8 2 2 9 6 14 41
West Ham 1 8 4 8 12 9 42
West Brom A. 2 7 3 5 6 6 29
Sunderland 2 2 3 4 6 4 21
114 128 166 180 161 251 1000

Here is the raw data for the EPL 2002/2003. It probably has a few mistakes, which I would be glad to have someone check. As mentioned above I had some trouble keeping track of own goals.

I e-mailed a professor of statistics, and he gave me a contingency table chi square test. When I ran it on this data, there was no significance to the theory that some teams score more goals in different parts of the game. All of the difference fell within statistically acceptable limits of random chance.

I could change the data, or the way I interpret it, but I think that gets into just using statistics to get the answer you want. So, at this point, I have to say that even though some teams did score more goals at certain parts of the game than others, this was within the expected statistical limits of chance.

Also, and this is probably going to be much more contorversial. The new, more accurate (I believe) test, shows no significance for more goals being scored later in the game.

Here, though, I might argue that we have a reasonably large set of data points, and it might be better to look at this from the standpoint of constructing a best fit curve rather than assumimg a uniform distribution of goals.

I'm still mulling a lot of this information. I primarily want to get the data out now, so others can look at it and possibly come up with better tests or hypotheses than I have been able to.

Karl K
11 Sep 2003, 11:54 PM
Originally posted by the101er
I e-mailed a professor of statistics, and he gave me a contingency table chi square test. When I ran it on this data, there was no significance to the theory that some teams score more goals in different parts of the game. All of the difference fell within statistically acceptable limits of random chance.

You sound a little bummed about this result, but I for one am thankful that you went through this analysis, and did it so methodically and appropriately. We need more work like this to really get to the heart of what goes on in the game.

It is not surprising to me that the distribution of goals across incremental time periods is random. When you have a lot of one thing (goals) done by humans, occuring over a long period of time (an EPL season), randomness seems the most reasonable outcome.

Sometimes it's just a valuable to stick a fork in what turns out be a wrongheaded idea as it is to confirm the unusual and "interesting" ideas.

You should write this up for a statistics journal, or a sports journal that publishes statistical studies.

beineke
12 Sep 2003, 10:40 AM
Originally posted by the101er

I e-mailed a professor of statistics, and he gave me a contingency table chi square test. When I ran it on this data, there was no significance to the theory that some teams score more goals in different parts of the game.

This is correct; however, I would advise you to stick to your original design and use three columns, not six. That would enable you to focus your statistical power on the beginning and end of the game, which you said was most interesting to you. (However, that won't change the result for this particular table.)


Also, and this is probably going to be much more contorversial. The new, more accurate (I believe) test, shows no significance for more goals being scored later in the game.

This is not correct. I'm not sure how you're drawing this conclusion, as it's not something you would test by looking at a 20 x 6 table.

In addition, you still haven't looked at the most interesting aspect of the data. It wasn't a big deal that Newcastle scored 6 goals early and 18 goals late. It was the fact that they were outscored early but that they overwhelmed their opponents (17-2) late in the game. By looking at goals scored but not goals allowed, you've only checked out one half of the equation.

beineke
12 Sep 2003, 11:05 AM
When people are skeptical about statistical significance of a pattern (fairly or not), it's always helpful when you can get more data.

I went back and tabulated Newcastle's for 2001-02. I chose them because they had the most striking pattern in 02-03, and because the 01-02 team was quite similar in many respects. It had the same manager, most of the same players, and finished in a similar position in the Premiership. Here's what I found:

Mins
1-15 Scored 5, Allowed 8
16-30 Scored 8, Allowed 13
31-45 Scored 12, Allowed 10
46-60 Scored 13, Allowed 9
61-75 Scored 17, Allowed 7
76-90 Scored 19, Allowed 5

This is remarkably similar to 02-03.

the101er
12 Sep 2003, 12:43 PM
That's great. That seems like a strong indication that Newcastle is being coached to perform a certain way in temporal, rather than physical space.

I have the numbers for goals allowed in 02-03, so I will do some checking. And yes, I was wrong about the overall trend as you can't do a sample test on a whole population.

It should be worth looking at teams like Arsenal, Liverpool and Chelsea to see if their results continue to differ from the norm.

Here is another question: when 3 teams with continental coaches fall below the 0.10 probability; combined do they make a case for a different coaching approach to the game?

the101er
12 Sep 2003, 01:01 PM
Here are the Newcastle results for goals allowed:

Chi squared: 12.96
DOF: 5
Probability: 0.027

This indicates statistically that Newcastle is doing something different from the rest of the league.

Here are all of the chi square and probability numbers I came up with for the league for 02-03 for goals scored.

chi square Probability
Manchester United 3.63 0.604
Arsenal 9.4 0.094
Newcastle 3.27 0.659
Chelsea 8.68 0.122
Liverpool 9.3 0.098
Blackburn 4.41 0.492
Everton 3.12 0.682
Southampton 4.88 0.431
Manchester City 8.43 0.143
Tottenham 7.78 0.169
Middlesboro 6.12 0.295
Charlton 3.2 0.669
Birmingham 10.3 0.068
Fulham 6.11 0.295
Leeds 3.37 0.643
Aston Villa 1.17 0.948
Bolton 9.45 0.093
West Ham 9.84 0.08
West Brom A. 4.72 0.451
Sunderland 2.66 0.752

Nutmeg
14 Sep 2003, 09:53 PM
TEAM GP G 1-15 16-30 31-45 46-60 61-75 76-90
Fire 24 42 6 3 9 6 9 8
Revs 24 40 5 3 8 8 6 10
Wizards 24 37 6 6 5 4 8 7
Quakes 24 35 2 7 7 8 2 9
Metros 23 33 1 4 5 5 7 9
Crew 24 32 2 3 10 7 4 6
United 23 31 7 2 10 1 0 11
Rapids 24 31 3 7 6 5 3 6
Galaxy 24 28 1 6 2 6 8 4
Burn 24 26 5 5 3 3 7 3
Totals 335 38 46 65 53 54 73



More numbers to chew on.

superdave
14 Sep 2003, 10:11 PM
That's the same pattern as in the EPL. Not as definitive, but I'll bet that's because of a much lower number of data points.

JG
04 Apr 2004, 12:42 AM
The enbltd MLS preview has each team's GF-GA for last season and all-time broken up by 15-minute segment.

http://www.enbltd.com/2004_mls_preview.pdf

ur_land
05 Apr 2004, 11:35 PM
Here are the Newcastle results for goals allowed:

Chi squared: 12.96
DOF: 5
Probability: 0.027

This indicates statistically that Newcastle is doing something different from the rest of the league.

Here are all of the chi square and probability numbers I came up with for the league for 02-03 for goals scored.

chi square Probability
Manchester United 3.63 0.604
Arsenal 9.4 0.094
Newcastle 3.27 0.659
Chelsea 8.68 0.122
Liverpool 9.3 0.098
Blackburn 4.41 0.492
Everton 3.12 0.682
Southampton 4.88 0.431
Manchester City 8.43 0.143
Tottenham 7.78 0.169
Middlesboro 6.12 0.295
Charlton 3.2 0.669
Birmingham 10.3 0.068
Fulham 6.11 0.295
Leeds 3.37 0.643
Aston Villa 1.17 0.948
Bolton 9.45 0.093
West Ham 9.84 0.08
West Brom A. 4.72 0.451
Sunderland 2.66 0.752


I think this is a good place to talk about controlling for multiple comparisions. When you use the p<.05 criteria, you are making the assumption that you will get a false positive 1 out of every 20 times [that's what the .05 means--your chance of finding a significant result when the null hypothesis (in this case that there is no difference in goals allowed over the course of the match) is actually true]. When you do lots of tests (say one each for each of the EPL teams) you should get, just by chance, a few significant results. Therefore, it is good practice to correct for multiple comparisions. The one I see used most often is the Bonferroni Correction. To use this you simply divide your alpha by the number of tests. The result becomes the new significance level. So, in the data above, the new alpha should be .05/20=.0025

As you can see, this is pretty severe. An alternate strategy is to analyze all of the teams together (like you did in the contingency table). You could also try this in a multiple regression framework. You would have to transform your data (probably using a natural log transformation or inverse transformation) because counts, by their nature, don't have a normal distribution, and that messes up some inferential stats.

This is where things get cool. Using a multiple regression approach allows you to use neat things called contrast codes. This isn't the place to fully explain the logic behind contrast codes (see Cohen, Cohen, Aiken & West, 2002 or Judd & McClelland, 1989), but what they allow you to do is meaningfully compare different groups of data. For example, if you have data for three time periods (T1, T2, T3) contrast codes can tell you if there are linear (does it get bigger or smaller from beginning to end?) or quadratic (does it go up and then down?)(or vice versa) trends in your data. I.E., this can tell you how your data changes over time.

We have six time periods in this data, which means we can examine five trends (think of trig and power curves--i.e. X^2, X^3, X^4--when thinking about these):

1) Linear: Do goals scored increase from the beginning of the game to the end? (or vice versa)

2) Quadratic: Do goals scored start out low, then get high near half time, then get low again near the end of the game?

3) Cubic: Do goals scored start out low, go higher, then go lower, then go higher? (I think this is the best one to test superdave's idea that halftime makes a difference)

The last two are probably too complex to worry about, but for the analysis to make sense and work properly, you need a full set of orthogonal contrast codes.

4)quintic: Do goals scored start out low, go higher, then go lower, then go higher, then end lower?

5) quartic: Do goals scored start out low, then get higher, then go lower, then go higher, then get lower, then end higher?

You could do this for GS or GA allowed data seperately or alnalyze both together in the context of a within-subjects analysis.

The only problem with this startegy is that 7 contrast codes eat up a lot of degrees of freedom (meaning you need a pretty large N to get good results). With only 20 teams, there might not be enough statistical power to find a significant result.

I'm down in my research group's sub-subbasement lab right now, but as soon as I'm done cleaning up from collecting data, I'll try to do a quick analysis of this.

And if anyone had troble followig the terms or the explanation, please let me know and I'll try to explain better.

ur_land
06 Apr 2004, 01:06 AM
Just noticed a mistake in the post right above this. Please switch the terms quartic and quintic (quartic is really the 4th order and quintic the 5th order).

Ok--here are the results for the gf data:

linear F=48.31 p<.0001

quadratic F=0.04 p<.71

cubic F=1.39 p<.26

quartic F=2.92 p<.11

quintic F=4.62 p<.05

And for reference, here are the means:

0-15=5.70 (sd=3.66)
16-30=6.40 (sd=4.01)
30-45=8.30 (sd=4.03)
46-60=9.00 (sd=3.60)
61-75=8.05 (sd=3.11)
76-90=12.55 (sd=4.30)

This is actually pretty cool--the linear trend means that, on average for the EPL last year, that goals scored increased from the beginning of the game to the end of the game. This could be due to defensive fatigue or to desperately trying to score an equalizer, I'm not sure which.

The only other significant trend is the quintic. This trend is a little weird, but what I think is driving it is the fact that goals scored are higher after halftime than before (being fired up from the manager?), then go down from 61-75min., then go up a huge amount at the end. I'd want to replicate it first, though, before drawing any strong conclusions from this trend.

ur_land
06 Apr 2004, 01:33 AM
Here is the same analysis for the 2003 MLS data:

linear F=10.43 p<.01
quadratic F=3.88 p<.09
cubic F=1.66 p<.24
quartic F=.004 p<.96
quintic F=1.32 p<.28

means
1-15=3.80 (sd=2.25)
16-30=4.60 (sd=1.83)
30-45=6.50 (sd=2.80)
46-60=5.30 (sd=2.21)
61-75=5.40 (sd=3.00)
76-90=7.30 (sd=2.58)


Even with a sample size of only ten, the linear trend is still significant (I forgot to mention that since theses contrast codes are orthogonal and are all being analyzed in the same within-subjects regression, multiple comparisons do not need to be corrected for). The quintic is not significant here, so I don't think we should place too much stock in its significance in the EPL data. It is interesting that in two different leagues, with different skill levels, and possibly different defensive strategies (a lot more goals were scored in the EPL than in MLS per game--but maybe this is more due to parity), there is a replication of the linear trend. As the game goes on, goals are more likely to be scored.

And since I'm on a roll, here are the analyses for MLS 2002:

linear F=7.24 p<.03
quadratic F=.16 p<.76
cubic F=.14 p<.72
quartic F=2.16 p<.17
quintic F=.005 p<.95

Means
1-15=5.30 (sd=2.06)
16-30=5.50 (sd=3.24)
30-45=7.40 (sd=2.27)
46-60=6.50 (sd=2.01)
61-75=7.30 (sd=1.49)
76-90=8.70 (sd=3.03)

Again, the linear trend is supported.

So, I think the answer is that the last 15 minutes are more important, at least in terms of goals scored.

Now the questions that remain are why? and are there meaningful variables that explain deviations from this pattern (i.e., interactions--is this more true for winning teams than losing teams, for example).