The door to linear weights starts to creak open.....

Discussion in 'Statistics and Analysis' started by mrvp, Aug 17, 2012.

  1. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Had to use the title as it sums up what I'll be trying to do as soon as I get the data - in short Manchester City in partnership with Opta have decided to release their Lite data for the 11/12 Premiership Season for all of the teams. In short we'll hold judgment until we see the actual data (I've already registered) but it looks promising.

    Amount of data is described here:

    http://www.mcfc.co.uk/The-Club/MCFC-Analytics/Data-available

    185 events in terms of data will be available so we should be able to get something out of that for 20 teams over 38 games.

    You can register here:

    http://www.mcfc.co.uk/Home/The Club/MCFC Analytics

    In short as someone tweeted elsewhere Linear Weights (for outcomes or dependent variables) should be there for the taking. One season of data proves nothing but it should provide a start and at least a hypothesis or two on what things should look like. X&Y data may be available upon request for the hardcore (so you can find out how inaccurate your favourite striker was, from any distance at any time in a game).

    Nice article about the process here:

    http://www.forbes.com/sites/zachsla...ases-full-season-of-opta-data-for-public-use/

    For football/soccer it should be the start of a game changer.........

    Enjoy,
     
  2. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    More information about what you may be getting if you have registered - is on Howard Hamilton's blog (he interviewed Manchester City's performance director Gavin Fleig a couple of months ago after he met him at the Sloan Sports Conference a couple of years ago) - some of the events are listed here:

    http://www.soccermetrics.net/match-data-collection/the-opta-data-schema-an-introduction

    They are also offering to make the data set more friendly once you get it e-mailed to you (personally I want to take a look in the csv file first and see what's inside and what I think I can do with it)

    http://www.soccermetrics.net/match-...a-management-for-manchester-city-opta-release
     
  3. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Just got hold of the data in the last hour. In short from a quick look it looks solid. You get an Excel spreadsheet plus the Opta definitions (not all of them are clear so some may take some figuring out).

    In short in terms of data and variables you will definitely need to do so some work to group certain things together:

    Columns P-FN in the xls contain variables of use (before column P it is mostly ID tags to the player, team and particular game), most of the rest of the columns after FN relate to the keeper (I haven't looked at this thoroughly yet so I am assuming it is an action from the player that resulted in that action from the opposition keeper). There are about a dozen extra player variables on the end of the sheet. Over 10000 player games are logged, so you may want to look at grouping stuff together (e.g. grouping/summing all of the games of individual players etc. - which is easily do-able in Excel with a bot of effort).
     
  4. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    List of events for anyone interested (data set covers 10000+ games and 538 individual players are logged):

    Date
    Player ID
    Player Surname
    Player Forename
    Team
    Team Id
    Opposition
    Opposition id
    Venue
    Position Id
    Appearances
    Time Played
    Starts
    Substitute On
    Substitute Off
    Goals
    First Goal
    Winning Goal
    Shots On Target inc goals
    Shots Off Target inc woodwork
    Blocked Shots
    Penalties Taken
    Penalty Goals
    Penalties Saved
    Penalties Off Target
    Penalties Not Scored
    Direct Free-kick Goals
    Direct Free-kick On Target
    Direct Free-kick Off Target
    Blocked Direct Free-kick
    Goals from Inside Box
    Shots On from Inside Box
    Shots Off from Inside Box
    Blocked Shots from Inside Box
    Goals from Outside Box
    Shots On Target Outside Box
    Shots Off Target Outside Box
    Blocked Shots Outside Box
    Headed Goals
    Headed Shots On Target
    Headed Shots Off Target
    Headed Blocked Shots
    Left Foot Goals
    Left Foot Shots On Target
    Left Foot Shots Off Target
    Left Foot Blocked Shots
    Right Foot Goals
    Right Foot Shots On Target
    Right Foot Shots Off Target
    Right Foot Blocked Shots
    Other Goals
    Other Shots On Target
    Other Shots Off Target
    Other Blocked Shots
    Shots Cleared off Line
    Shots Cleared off Line Inside Area
    Shots Cleared off Line Outside Area
    Goals Open Play
    Goals from Corners
    Goals from Throws
    Goals from Direct Free Kick
    Goals from Set Play
    Goals from penalties
    Attempts Open Play on target
    Attempts from Corners on target
    Attempts from Throws on target
    Attempts from Direct Free Kick on target
    Attempts from Set Play on target
    Attempts from Penalties on target
    Attempts Open Play off target
    Attempts from Corners off target
    Attempts from Throws off target
    Attempts from Direct Free Kick off target
    Attempts from Set Play off target
    Attempts from Penalties off target
    Goals as a substitute
    Total Successful Passes All
    Total Unsuccessful Passes All
    Assists
    Key Passes
    Total Successful Passes Excl Crosses Corners
    Total Unsuccessful Passes Excl Crosses Corners
    Successful Passes Own Half
    Unsuccessful Passes Own Half
    Successful Passes Opposition Half
    Unsuccessful Passes Opposition Half
    Successful Passes Defensive third
    Unsuccessful Passes Defensive third
    Successful Passes Middle third
    Unsuccessful Passes Middle third
    Successful Passes Final third
    Unsuccessful Passes Final third
    Successful Short Passes
    Unsuccessful Short Passes
    Successful Long Passes
    Unsuccessful Long Passes
    Successful Flick-Ons
    Unsuccessful Flick-Ons
    Successful Crosses Corners
    Unsuccessful Crosses Corners
    Corners Taken incl short corners
    Corners Conceded
    Successful Corners into Box
    Unsuccessful Corners into Box
    Short Corners
    Throw Ins to Own Player
    Throw Ins to Opposition Player
    Successful Dribbles
    Unsuccessful Dribbles
    Successful Crosses Corners Left
    Unsuccessful Crosses Corners Left
    Successful Crosses Left
    Unsuccessful Crosses Left
    Successful Corners Left
    Unsuccessful Corners Left
    Successful Crosses Corners Right
    Unsuccessful Crosses Corners Right
    Successful Crosses Right
    Unsuccessful Crosses Right
    Successful Corners Right
    Unsuccessful Corners Right
    Successful Long Balls
    Unsuccessful Long Balls
    Successful Lay-Offs
    Unsuccessful Lay-Offs
    Through Ball
    Successful Crosses Corners in the air
    Unsuccessful Crosses Corners in the air
    Successful crosses in the air
    Unsuccessful crosses in the air
    Successful open play crosses
    Unsuccessful open play crosses
    Touches
    Goal Assist Corner
    Goal Assist Free Kick
    Goal Assist Throw In
    Goal Assist Goal Kick
    Goal Assist Set Piece
    Key Corner
    Key Free Kick
    Key Throw In
    Key Goal Kick
    Key Set Pieces
    Duels won
    Duels lost
    Aerial Duels won
    Aerial Duels lost
    Ground Duels won
    Ground Duels lost
    Tackles Won
    Tackles Lost
    Last Man Tackle
    Total Clearances
    Headed Clearances
    Other Clearances
    Clearances Off the Line
    Blocks
    Interceptions
    Recoveries
    Total Fouls Conceded
    Fouls Conceded exc handballs pens
    Total Fouls Won
    Fouls Won in Danger Area inc pens
    Fouls Won not in danger area
    Foul Won Penalty
    Handballs Conceded
    Penalties Conceded
    Offsides
    Yellow Cards
    Red Cards
    Goals Conceded
    Goals Conceded Inside Box
    Goals Conceded Outside Box
    Saves Made
    Saves Made from Inside Box
    Saves Made from Outside Box
    Saves from Penalty
    Catches
    Punches
    Drops
    Crosses not Claimed
    GK Distribution
    GK Successful Distribution
    GK Unsuccessful Distribution
    Clean Sheets
    Team Clean sheet
    Error leading to Goal
    Error leading to Attempt
    Challenge Lost
    Shots On Conceded
    Shots On Conceded Inside Box
    Shots On Conceded Outside Box
    Team Formation
    Position in Formation
    Turnovers
    Dispossessed
    Big Chances
    Big Chances Faced
    Pass Forward
    Pass Backward
    Pass Left
    Pass Right
    Unsuccessful Ball Touch
    Successful Ball Touch
    Take-Ons Overrun
    CompId
    SeasId
    Touches open play final third
    Touches open play opp box
    Touches open play opp six yards
     
  5. TheAnswer1313

    TheAnswer1313 Member+

    Dec 12, 2007
    Charleston, WV
    Club:
    Arsenal FC
    Nat'l Team:
    Italy
    I downloaded it too
    Massive amount of data.
     
  6. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Nice to know I'm not the only one. I've just converted the whole data set into values for the season by summing all of the variables using a sumif formula (with the criteria being the player ID - some variables obviously won't be suitable for this or relevant when totaled up). I'll start trying to group the variables together in the week (e.g. probably colour code them for ease of use - for instance all of the penalty related variables in yellow) and then start running some regressions and see if we can get something out of it. I'm just figuring there are enough sharp people on here (at a variety of things) I'd be surprised if we can't get something out of this.
     
  7. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Okay I'm going to take a tentative stab (I'm completely open to constructive criticism) at liner weights for goals scored - I've added some caveats in along the way to try and get worthwhile results:

    I grouped the data so it gave season long data for all 538 players who played in the 2011/12 season (instead of by individual game). I then cut out all players who played less than 1000 minutes (about 25% of the season), and scored less than 5 goals (to try and remove potential outliers). This reduced the data set to only 65 players.

    I then went through the variables and removed any variables that were not relevant (e.g. number of goalkeeper punches, wouldn't govern how many goals a player would score - as the player would have a value of 0 for this as he is an outfield player and not a goalkeeper). I then started running regressions and cut out every variable with a p value greater than 0.05 (e.g. statistically insignificant). This has then left me with a variety of different options with r2 values between 0.84 and 0.89 (depending on slight changes to variables). In the odd case I've added a variable back in where it previously had a significant p-value - here are two of the possible options I have come up with:

    0.37*Shots on Target Inc. Goals + -1.14*Foul Won Penalty + 0.24*Touches Open Play in Opp.6 Yard Box +1.6 (Intercept) + -0.17*Attempts in Open Play on Target

    This gives an R-Squared of 0.847 and a Standard Error of 2.04 - with all variables with a p-value of less than 0.05

    Adding "Big Chances" back in which was marginally insignificant

    0.31*Shots on Target Inc. Goals + -1.15*Foul Won Penalty + 2.166 (Intercept) + 0.19*Big Chances + -0.18*Attempts in Open Play on Target + 0.11*Touches Open Play in Opp.6 Yard Box

    This gives an R-Squared of 0.868 and a Standard Error of 1.92 - with all variables with a p-value of less than 0.05 except Touches Open Play in Opp.6 Yard Box (which was marginally insignificant at 0.107 with a low co-efficient value).

    Looking at the values I think you can draw a variety of conclusions:

    The co-efficient value for shots on target including goals is around 0.3 to 0.4 in relation to goals (therefore shots on target are worth significantly more to a team than shots off target).
    The Foul Won Penalty having a large negative co-efficient is probably reasonable on the basis that the player fouled is not always the penalty taker and giving away a penalty is usually a last ditch attempt at trying to avoid conceeding a goal (therefore the negative co-efficient).
    Touches Open Play in Opponents 6 Yards box having a positive co-efficient and being significant again seems reasonable on the basis that the closer you are to the goal the easier it is you would assume to score?
    Big Chances being significant and having a positive co-efficient would seem obvious?
    Attempts in Open Play on Target having a negative co-efficient in both cases is one I am struggling with. My theory is that this probably accounts for shots from the edge of the box and outside of the box (only approximately 2% of goals came from outside of the box)?

    Does anyone have any thoughts on this?
     
  8. kinznk

    kinznk Member

    Feb 11, 2007
    I wish they had a column for own goals. I wanted to do some team things so I made a pivot table to make totals for each team and used goals versus goals conceded so I could get a W, D, or L but the numbers didn't come out right. I can only assume that I did not account for own goals.
     
  9. Bulgarian_Football_Fan

    Sep 3, 2012
    Club:
    Swansea City AFC
    Yeah, I got the data too.

    It's ********ing huge... I have to get a day-off to look at everything...
     
  10. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Yeah still trying to get my head around it - have now started to look at the full set of 530 odd players instead (based on season data for each player rather than individual games) and have now done regressions on various variables individually against goals scored (rather than trying to fit it to suit a certain type of player):

    All p values for the following variables are less than 0.000 (so very significant)
    ** ones are things I like;

    **Shots on Target including Goals - coefficient: 0.312, R2=0.871 Adj R2=0.869 (this seems a reasonable predictor right there at 0.87 and a 31%ish strike rate?)
    **Shots on Target from Inside the Box - coefficient: 0.441, R2=0.862 Adj R2=0.860 (this would include some of the shots on target including goals statistic and at 0.86 is reasonable)
    Shots on Target from Outside the Box - coefficient: 0.672, R2=0.510 Adj R2=0.508 (don't like this as the lower R2 puts less value behind it and the higher coefficient than the shots from inside the box to me doesn't make sense as it implies that there is more value on shots taken from outside than inside the box?)
    Head shots on Target - coefficient: 1.231, R2=0.495 Adj R2=0.493 (discounting this for somewhat the same reasons as above - the low R2 value and the co-efficient doesn't make sense e.g. 1.2 goals from every shot on target from a header?)
    **Attempts from Open Play on Target - coefficient=0.414, R2=0.811 Adj R2=0.809
    **Big Chances - coefficient=0.646, R2=0.844 Adj R2=0.842 (like this as even if we don't have a definition of what a big chance is, to me it would imply that it is better than a shot on target from inside the box or a shot on target including a goals and this to me justifies the higher co-efficient with an equivalent R2 value).

    Incidentally using the 0.312 coefficient for shots on target including goals above, and then simulating 5000 games (using data tables in Excel and a poisson random variable with an average number of goals per game as a mean) and taking an average (for 38 games); based on 82 shots on target including goals Robin Van Persie was predicted to score 25-26 goals last season, so you could argue by scoring 30 goals last season that he outperformed expectations and it was foolish of Arsenal to sell him in that respect (couldn't happen to a nicer team:) !).

    Hope this helps someone........ and am happy to bounce something around.
     
  11. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Following on from the above, in relation to RVP - if you take his performances over the last three games for 12/13 (he started every game except for one for Arsenal in 11/12 so it is virtually a fair comparison even though he only played 20 minutes or so of the first game this year) based on the above co-efficient that would project out to a 23-24 goal season (this may work better on a minutes played basis but I don't have that data for 12/13). Of course this assumes a lot of things (e.g. that Rooney doesn't come back and take chances off of him etc., after three games they haven't played a good variety of teams etc.)
     
  12. palynka

    palynka Member

    Jun 7, 2006
    Nat'l Team:
    Portugal
    A few technical questions/remarks, although I understand that having some results is better than no results. Still, it's good to understand if we might have biases and to understand in which direction we should expect them. Other issues you might not have enough data on yet to work with.

    1) Games often open up after the first goal, so the "independent" variables are not really independent, won't the problem of reverse causality be there?

    2) Do you have individual fixed effects? For example, you want to abstract from quality difference of players taking a shot, to understand the true contribution of a shot on target.
     
  13. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Totally agreed I think on point 1 - but wouldn't running a regression on just shots on target including goals, against goals over several seasons (I could grab the data off of ESPN as they use the Opta data for the shots on target stat) cut this out e.g. if the figure holds at around 0.31 (or would it just confirm the reverse causality is consistent, or confirm both are consistent??). I was just trying to put a general number against shots on target in terms of what it added regardless of the other effects. Really valid point though as it is true - but we are limited by the data set available (depends how far you want to go with it).

    Totally agreed on point 2, but we can only work on the data that we have got (e.g. shots on target), you can split it for inside the box and outside the box but you are still making assumptions on the quality of the shot from either position (e.g. the point I made above - I think the low R2 on shots on target from outside the box, makes it reasonable to discount them as a contributor?).
     
  14. snahdog

    snahdog Member

    Mar 31, 2006
    Atlanta
    Thanks for this thread. Makes me wish that I were still in school, I would definitely have done some sort of statistics project using this data. I will see if I can do something in my free time.
     
  15. skydog

    skydog Member+

    Aug 1, 1999
    Durham, NC
    Club:
    Los Angeles Galaxy
    I think you are misinterpreting the coefficients. The coefficients can be high for two different reasons . One is that the action itself is likely to produce a goal. A second is that the action is a proxy for skill. The coefficients will most likely represent a bit of both, some being more driven by the goals they create and others more determined by how much they are reflective of a players skill level.
    For example the coefficient for shots on target outside of the box being higher than for shots on target from inside the box doesn't mean that a player is more likely to score if shooting from a longer distance. It just means that knowing a player has a lot of shots on target from outside the box is a better predictor of the player being a good goal scorer than shots on target from inside the box. This may be due to the fact that putting a shot on target from a longer distance is more difficult than doing so from close range. It also may be partly due to the fact that the best goal scorers are more likely to try shots from distance because their success rate has been higher than the average player for these long shots. So top forwards shoot more long range shot and end up with more being on goal.

    Another example of this is that as you note it doesn't make sense to interpret the Head Shots on Target coefficient of 1.2 as meaning that the player will score 1.2 goals per header on target! It just means that players who getting a lot of headers on target is a very strong predictor of a player being a top goal scorer. Another thought - it also may be a strong predictor of being a forward! So doing the regressions separately by position would be a good test of this.

    The other thing to realize is that coefficients interact in a regression analysis. The 1.2 coefficient for Head Shots is very high no matter how you interpret it. I wonder if you put in analyses with only 1 or 2 other predictors it would stay that high?
     
    blacksun repped this.
  16. Jeff FTBpro

    Jeff FTBpro New Member

    Nov 7, 2012
    London, United Kingdom
    Club:
    Liverpool FC
    Nat'l Team:
    England
    But does " Money ball" really work in football?
     
  17. mrvp

    mrvp Member

    Jun 26, 2012
    South of the River
    Club:
    Chelsea FC
    Cool post but there are some things that I agree with and some I don't totally agree with - and again this may be my misunderstanding or lack of knowledge on the subject as this was never my thing when I was at school - or in one case I'm just reading this a different way. Re: the proxy for skill - I hadn't considered that and would totally agree with that if we were talking about a small group of players but we are talking about 530 players over a full season. Admittedly some players are playing every game, some are playing for less than a game, some players are internationals, some aren't, some you would regard as skillful, some you wouldn't - they come from all outfield positions. What I am getting at is due to the sample size shouldn't this rule out the proxy for skill essentially? I'm sure you could try and fit it to a certain kind of player as I did originally (e.g. we could isolate players with over 30% of their shots coming from outside of the box and say this being over 30 shots) and that would result in different co-efficient. I totally realised the point about the shots outside the box example - even if I didn't exactly make it clear - but I didn't like the R2 value. It suggests that it is a better predictor of a good goal scorer and is potentially a better indicator of someone that will be a good goalscorer (e.g. shots from outside the box having the higher coefficient) but it appears to be a substantially less significant indicator than shots from inside the box due to the lower R2.

    Re: the headers - and all of the other variables - I was only doing one variable at a time (to try and get the value of each individual variable respectively) but isolating it by position would be possible (it's been a while since I've looked at these as the forum seemed a touch dead and I'd moved onto other things). I don't like headers because of the R2 value and the coefficient. The R2 against headers on target with an R2 of 0.49 suggests that it only accounts for 49% of the variation in goals scored. That with the value of 1.2 for a coefficient which in principle is nuts (I will post up the residuals if I still have them to illustrate what I mean) means the variable doesn't sit easily with me.
     

Share This Page