Had to use the title as it sums up what I'll be trying to do as soon as I get the data - in short Manchester City in partnership with Opta have decided to release their Lite data for the 11/12 Premiership Season for all of the teams. In short we'll hold judgment until we see the actual data (I've already registered) but it looks promising. Amount of data is described here: http://www.mcfc.co.uk/The-Club/MCFC-Analytics/Data-available 185 events in terms of data will be available so we should be able to get something out of that for 20 teams over 38 games. You can register here: http://www.mcfc.co.uk/Home/The Club/MCFC Analytics In short as someone tweeted elsewhere Linear Weights (for outcomes or dependent variables) should be there for the taking. One season of data proves nothing but it should provide a start and at least a hypothesis or two on what things should look like. X&Y data may be available upon request for the hardcore (so you can find out how inaccurate your favourite striker was, from any distance at any time in a game). Nice article about the process here: http://www.forbes.com/sites/zachsla...ases-full-season-of-opta-data-for-public-use/ For football/soccer it should be the start of a game changer......... Enjoy,
More information about what you may be getting if you have registered - is on Howard Hamilton's blog (he interviewed Manchester City's performance director Gavin Fleig a couple of months ago after he met him at the Sloan Sports Conference a couple of years ago) - some of the events are listed here: http://www.soccermetrics.net/match-data-collection/the-opta-data-schema-an-introduction They are also offering to make the data set more friendly once you get it e-mailed to you (personally I want to take a look in the csv file first and see what's inside and what I think I can do with it) http://www.soccermetrics.net/match-...a-management-for-manchester-city-opta-release
Just got hold of the data in the last hour. In short from a quick look it looks solid. You get an Excel spreadsheet plus the Opta definitions (not all of them are clear so some may take some figuring out). In short in terms of data and variables you will definitely need to do so some work to group certain things together: Columns P-FN in the xls contain variables of use (before column P it is mostly ID tags to the player, team and particular game), most of the rest of the columns after FN relate to the keeper (I haven't looked at this thoroughly yet so I am assuming it is an action from the player that resulted in that action from the opposition keeper). There are about a dozen extra player variables on the end of the sheet. Over 10000 player games are logged, so you may want to look at grouping stuff together (e.g. grouping/summing all of the games of individual players etc. - which is easily do-able in Excel with a bot of effort).
List of events for anyone interested (data set covers 10000+ games and 538 individual players are logged): Date Player ID Player Surname Player Forename Team Team Id Opposition Opposition id Venue Position Id Appearances Time Played Starts Substitute On Substitute Off Goals First Goal Winning Goal Shots On Target inc goals Shots Off Target inc woodwork Blocked Shots Penalties Taken Penalty Goals Penalties Saved Penalties Off Target Penalties Not Scored Direct Free-kick Goals Direct Free-kick On Target Direct Free-kick Off Target Blocked Direct Free-kick Goals from Inside Box Shots On from Inside Box Shots Off from Inside Box Blocked Shots from Inside Box Goals from Outside Box Shots On Target Outside Box Shots Off Target Outside Box Blocked Shots Outside Box Headed Goals Headed Shots On Target Headed Shots Off Target Headed Blocked Shots Left Foot Goals Left Foot Shots On Target Left Foot Shots Off Target Left Foot Blocked Shots Right Foot Goals Right Foot Shots On Target Right Foot Shots Off Target Right Foot Blocked Shots Other Goals Other Shots On Target Other Shots Off Target Other Blocked Shots Shots Cleared off Line Shots Cleared off Line Inside Area Shots Cleared off Line Outside Area Goals Open Play Goals from Corners Goals from Throws Goals from Direct Free Kick Goals from Set Play Goals from penalties Attempts Open Play on target Attempts from Corners on target Attempts from Throws on target Attempts from Direct Free Kick on target Attempts from Set Play on target Attempts from Penalties on target Attempts Open Play off target Attempts from Corners off target Attempts from Throws off target Attempts from Direct Free Kick off target Attempts from Set Play off target Attempts from Penalties off target Goals as a substitute Total Successful Passes All Total Unsuccessful Passes All Assists Key Passes Total Successful Passes Excl Crosses Corners Total Unsuccessful Passes Excl Crosses Corners Successful Passes Own Half Unsuccessful Passes Own Half Successful Passes Opposition Half Unsuccessful Passes Opposition Half Successful Passes Defensive third Unsuccessful Passes Defensive third Successful Passes Middle third Unsuccessful Passes Middle third Successful Passes Final third Unsuccessful Passes Final third Successful Short Passes Unsuccessful Short Passes Successful Long Passes Unsuccessful Long Passes Successful Flick-Ons Unsuccessful Flick-Ons Successful Crosses Corners Unsuccessful Crosses Corners Corners Taken incl short corners Corners Conceded Successful Corners into Box Unsuccessful Corners into Box Short Corners Throw Ins to Own Player Throw Ins to Opposition Player Successful Dribbles Unsuccessful Dribbles Successful Crosses Corners Left Unsuccessful Crosses Corners Left Successful Crosses Left Unsuccessful Crosses Left Successful Corners Left Unsuccessful Corners Left Successful Crosses Corners Right Unsuccessful Crosses Corners Right Successful Crosses Right Unsuccessful Crosses Right Successful Corners Right Unsuccessful Corners Right Successful Long Balls Unsuccessful Long Balls Successful Lay-Offs Unsuccessful Lay-Offs Through Ball Successful Crosses Corners in the air Unsuccessful Crosses Corners in the air Successful crosses in the air Unsuccessful crosses in the air Successful open play crosses Unsuccessful open play crosses Touches Goal Assist Corner Goal Assist Free Kick Goal Assist Throw In Goal Assist Goal Kick Goal Assist Set Piece Key Corner Key Free Kick Key Throw In Key Goal Kick Key Set Pieces Duels won Duels lost Aerial Duels won Aerial Duels lost Ground Duels won Ground Duels lost Tackles Won Tackles Lost Last Man Tackle Total Clearances Headed Clearances Other Clearances Clearances Off the Line Blocks Interceptions Recoveries Total Fouls Conceded Fouls Conceded exc handballs pens Total Fouls Won Fouls Won in Danger Area inc pens Fouls Won not in danger area Foul Won Penalty Handballs Conceded Penalties Conceded Offsides Yellow Cards Red Cards Goals Conceded Goals Conceded Inside Box Goals Conceded Outside Box Saves Made Saves Made from Inside Box Saves Made from Outside Box Saves from Penalty Catches Punches Drops Crosses not Claimed GK Distribution GK Successful Distribution GK Unsuccessful Distribution Clean Sheets Team Clean sheet Error leading to Goal Error leading to Attempt Challenge Lost Shots On Conceded Shots On Conceded Inside Box Shots On Conceded Outside Box Team Formation Position in Formation Turnovers Dispossessed Big Chances Big Chances Faced Pass Forward Pass Backward Pass Left Pass Right Unsuccessful Ball Touch Successful Ball Touch Take-Ons Overrun CompId SeasId Touches open play final third Touches open play opp box Touches open play opp six yards
Nice to know I'm not the only one. I've just converted the whole data set into values for the season by summing all of the variables using a sumif formula (with the criteria being the player ID - some variables obviously won't be suitable for this or relevant when totaled up). I'll start trying to group the variables together in the week (e.g. probably colour code them for ease of use - for instance all of the penalty related variables in yellow) and then start running some regressions and see if we can get something out of it. I'm just figuring there are enough sharp people on here (at a variety of things) I'd be surprised if we can't get something out of this.
Okay I'm going to take a tentative stab (I'm completely open to constructive criticism) at liner weights for goals scored - I've added some caveats in along the way to try and get worthwhile results: I grouped the data so it gave season long data for all 538 players who played in the 2011/12 season (instead of by individual game). I then cut out all players who played less than 1000 minutes (about 25% of the season), and scored less than 5 goals (to try and remove potential outliers). This reduced the data set to only 65 players. I then went through the variables and removed any variables that were not relevant (e.g. number of goalkeeper punches, wouldn't govern how many goals a player would score - as the player would have a value of 0 for this as he is an outfield player and not a goalkeeper). I then started running regressions and cut out every variable with a p value greater than 0.05 (e.g. statistically insignificant). This has then left me with a variety of different options with r2 values between 0.84 and 0.89 (depending on slight changes to variables). In the odd case I've added a variable back in where it previously had a significant p-value - here are two of the possible options I have come up with: 0.37*Shots on Target Inc. Goals + -1.14*Foul Won Penalty + 0.24*Touches Open Play in Opp.6 Yard Box +1.6 (Intercept) + -0.17*Attempts in Open Play on Target This gives an R-Squared of 0.847 and a Standard Error of 2.04 - with all variables with a p-value of less than 0.05 Adding "Big Chances" back in which was marginally insignificant 0.31*Shots on Target Inc. Goals + -1.15*Foul Won Penalty + 2.166 (Intercept) + 0.19*Big Chances + -0.18*Attempts in Open Play on Target + 0.11*Touches Open Play in Opp.6 Yard Box This gives an R-Squared of 0.868 and a Standard Error of 1.92 - with all variables with a p-value of less than 0.05 except Touches Open Play in Opp.6 Yard Box (which was marginally insignificant at 0.107 with a low co-efficient value). Looking at the values I think you can draw a variety of conclusions: The co-efficient value for shots on target including goals is around 0.3 to 0.4 in relation to goals (therefore shots on target are worth significantly more to a team than shots off target). The Foul Won Penalty having a large negative co-efficient is probably reasonable on the basis that the player fouled is not always the penalty taker and giving away a penalty is usually a last ditch attempt at trying to avoid conceeding a goal (therefore the negative co-efficient). Touches Open Play in Opponents 6 Yards box having a positive co-efficient and being significant again seems reasonable on the basis that the closer you are to the goal the easier it is you would assume to score? Big Chances being significant and having a positive co-efficient would seem obvious? Attempts in Open Play on Target having a negative co-efficient in both cases is one I am struggling with. My theory is that this probably accounts for shots from the edge of the box and outside of the box (only approximately 2% of goals came from outside of the box)? Does anyone have any thoughts on this?
I wish they had a column for own goals. I wanted to do some team things so I made a pivot table to make totals for each team and used goals versus goals conceded so I could get a W, D, or L but the numbers didn't come out right. I can only assume that I did not account for own goals.
Yeah still trying to get my head around it - have now started to look at the full set of 530 odd players instead (based on season data for each player rather than individual games) and have now done regressions on various variables individually against goals scored (rather than trying to fit it to suit a certain type of player): All p values for the following variables are less than 0.000 (so very significant) ** ones are things I like; **Shots on Target including Goals - coefficient: 0.312, R2=0.871 Adj R2=0.869 (this seems a reasonable predictor right there at 0.87 and a 31%ish strike rate?) **Shots on Target from Inside the Box - coefficient: 0.441, R2=0.862 Adj R2=0.860 (this would include some of the shots on target including goals statistic and at 0.86 is reasonable) Shots on Target from Outside the Box - coefficient: 0.672, R2=0.510 Adj R2=0.508 (don't like this as the lower R2 puts less value behind it and the higher coefficient than the shots from inside the box to me doesn't make sense as it implies that there is more value on shots taken from outside than inside the box?) Head shots on Target - coefficient: 1.231, R2=0.495 Adj R2=0.493 (discounting this for somewhat the same reasons as above - the low R2 value and the co-efficient doesn't make sense e.g. 1.2 goals from every shot on target from a header?) **Attempts from Open Play on Target - coefficient=0.414, R2=0.811 Adj R2=0.809 **Big Chances - coefficient=0.646, R2=0.844 Adj R2=0.842 (like this as even if we don't have a definition of what a big chance is, to me it would imply that it is better than a shot on target from inside the box or a shot on target including a goals and this to me justifies the higher co-efficient with an equivalent R2 value). Incidentally using the 0.312 coefficient for shots on target including goals above, and then simulating 5000 games (using data tables in Excel and a poisson random variable with an average number of goals per game as a mean) and taking an average (for 38 games); based on 82 shots on target including goals Robin Van Persie was predicted to score 25-26 goals last season, so you could argue by scoring 30 goals last season that he outperformed expectations and it was foolish of Arsenal to sell him in that respect (couldn't happen to a nicer team !). Hope this helps someone........ and am happy to bounce something around.
Following on from the above, in relation to RVP - if you take his performances over the last three games for 12/13 (he started every game except for one for Arsenal in 11/12 so it is virtually a fair comparison even though he only played 20 minutes or so of the first game this year) based on the above co-efficient that would project out to a 23-24 goal season (this may work better on a minutes played basis but I don't have that data for 12/13). Of course this assumes a lot of things (e.g. that Rooney doesn't come back and take chances off of him etc., after three games they haven't played a good variety of teams etc.)
A few technical questions/remarks, although I understand that having some results is better than no results. Still, it's good to understand if we might have biases and to understand in which direction we should expect them. Other issues you might not have enough data on yet to work with. 1) Games often open up after the first goal, so the "independent" variables are not really independent, won't the problem of reverse causality be there? 2) Do you have individual fixed effects? For example, you want to abstract from quality difference of players taking a shot, to understand the true contribution of a shot on target.
Totally agreed I think on point 1 - but wouldn't running a regression on just shots on target including goals, against goals over several seasons (I could grab the data off of ESPN as they use the Opta data for the shots on target stat) cut this out e.g. if the figure holds at around 0.31 (or would it just confirm the reverse causality is consistent, or confirm both are consistent??). I was just trying to put a general number against shots on target in terms of what it added regardless of the other effects. Really valid point though as it is true - but we are limited by the data set available (depends how far you want to go with it). Totally agreed on point 2, but we can only work on the data that we have got (e.g. shots on target), you can split it for inside the box and outside the box but you are still making assumptions on the quality of the shot from either position (e.g. the point I made above - I think the low R2 on shots on target from outside the box, makes it reasonable to discount them as a contributor?).
Thanks for this thread. Makes me wish that I were still in school, I would definitely have done some sort of statistics project using this data. I will see if I can do something in my free time.
I think you are misinterpreting the coefficients. The coefficients can be high for two different reasons . One is that the action itself is likely to produce a goal. A second is that the action is a proxy for skill. The coefficients will most likely represent a bit of both, some being more driven by the goals they create and others more determined by how much they are reflective of a players skill level. For example the coefficient for shots on target outside of the box being higher than for shots on target from inside the box doesn't mean that a player is more likely to score if shooting from a longer distance. It just means that knowing a player has a lot of shots on target from outside the box is a better predictor of the player being a good goal scorer than shots on target from inside the box. This may be due to the fact that putting a shot on target from a longer distance is more difficult than doing so from close range. It also may be partly due to the fact that the best goal scorers are more likely to try shots from distance because their success rate has been higher than the average player for these long shots. So top forwards shoot more long range shot and end up with more being on goal. Another example of this is that as you note it doesn't make sense to interpret the Head Shots on Target coefficient of 1.2 as meaning that the player will score 1.2 goals per header on target! It just means that players who getting a lot of headers on target is a very strong predictor of a player being a top goal scorer. Another thought - it also may be a strong predictor of being a forward! So doing the regressions separately by position would be a good test of this. The other thing to realize is that coefficients interact in a regression analysis. The 1.2 coefficient for Head Shots is very high no matter how you interpret it. I wonder if you put in analyses with only 1 or 2 other predictors it would stay that high?
Cool post but there are some things that I agree with and some I don't totally agree with - and again this may be my misunderstanding or lack of knowledge on the subject as this was never my thing when I was at school - or in one case I'm just reading this a different way. Re: the proxy for skill - I hadn't considered that and would totally agree with that if we were talking about a small group of players but we are talking about 530 players over a full season. Admittedly some players are playing every game, some are playing for less than a game, some players are internationals, some aren't, some you would regard as skillful, some you wouldn't - they come from all outfield positions. What I am getting at is due to the sample size shouldn't this rule out the proxy for skill essentially? I'm sure you could try and fit it to a certain kind of player as I did originally (e.g. we could isolate players with over 30% of their shots coming from outside of the box and say this being over 30 shots) and that would result in different co-efficient. I totally realised the point about the shots outside the box example - even if I didn't exactly make it clear - but I didn't like the R2 value. It suggests that it is a better predictor of a good goal scorer and is potentially a better indicator of someone that will be a good goalscorer (e.g. shots from outside the box having the higher coefficient) but it appears to be a substantially less significant indicator than shots from inside the box due to the lower R2. Re: the headers - and all of the other variables - I was only doing one variable at a time (to try and get the value of each individual variable respectively) but isolating it by position would be possible (it's been a while since I've looked at these as the forum seemed a touch dead and I'd moved onto other things). I don't like headers because of the R2 value and the coefficient. The R2 against headers on target with an R2 of 0.49 suggests that it only accounts for 49% of the variation in goals scored. That with the value of 1.2 for a coefficient which in principle is nuts (I will post up the residuals if I still have them to illustrate what I mean) means the variable doesn't sit easily with me.