Goals Added Model Updates
/Our Goals Added (g+) tables in the app have been updated with a few tweaks to the methodology, and because everyone loves methodology articles, I will share the details with you.
Model features
The g+ methodology includes a collection of (mostly) xgboost models. xgboost stands for eXtreme gradient boosting, but confusingly has the letters “xg” in it. There is no relationship to expected goals. Most of the models include predictor variables related to the “circumstances of the possession,” as I have vaguely noted below. That’s where most of the recent changes to modeling took place. Here’s a summary of the models for background:
Pre-shot expected goals (xG): this model look at the details of the shot origin location and the circumstances of the possession to determine the probability such a shot will score. Essentially, this gives us the shots expected value at the time of shooting, measuring the quality of the opportunity.
Post-shot expected goals (PSxG): in addition to information about the shot origin and circumstances of the possession, we also use the shot’s placement in the goal mouth to determine the probability of a goal.
Expected rebound value (xRebound): given information about the shot, including pre-shot information and post-shot information, we calculated the expected value of remaining shots during the possession.
Expected parry value (xParry): given all the information for xRebound and information about to where the keeper parried the ball, we estimate the expected value of the remaining shots during the possession.
Expected pass rate (xPass): given information about the pass’ origin location, direction of travel, and circumstances of the possession, we estimate the probability that the pass in completed.
Expected possession value: given information about the circumstances of the possession, we estimate the likelihood that this possession will end in a goal for the possessing team minus the likelihood that the ensuing possession will end in a goal for the now-defending team.
That thrilling summary of our models serves as a reminder that the circumstances of the possession is central to most of the engines that drive Goals Added. We added some variables to our models to better capture all the different circumstances that can come up during play. One of the insights we’ve gleaned from Where Goals Come From--the work of Jamon and Carlon--is that where the ball comes from is a key driver of the likelihood a shot finds itself into the back of the net. As a simple example, think of a player who receives the ball at the top of the box and shoots. Whether they received the pass as a “cutback” (requiring no turn) or from the midfield (requiring a turn) ought to affect the probability of them scoring the ensuing shot. And it does. We took that idea and expanded it to many of our models, as one of the ways we now better describe the possession and how the ball got to where it was.
Here are details about how we better quantify the circumstances of the possession.
xVelocity. This is the velocity of the possession vertically up the field, and it was formerly captured in just one variable. We now use three distinct measures of vertical velocity to capture various rolling windows, looking back two actions, four actions, and since the start of the possession. This allows the algorithm to better learn in which circumstances might the overall speed of play impact the probability of scoring over many actions, versus in which circumstances might only the last few actions drive the possession result.
yVelocity. This is the velocity of the possession horizontally, or side to side, and it was formerly captured in just one variable. For the physics nerds out there, side-to-side “velocity” was never strictly a velocity, but actually a speed--how fast had the ball moved horizontally, regardless of direction, over the past two actions. Thus a pass right back to the original source could add horizontal velocity. This was our way of capturing the ensuing defensive movement and potential for the defense to be out of shape. We kept that variable, on the assumption that skipping the ball back and forth could have some value in opening up the possession, and added two more horizontal velocity metrics (which are, actually, velocities by definition). We now include in the models the yVelocity of the last action, as well as the yVelocity of the last two actions combined (where passing back to the source would result in a yVelocity of zero). Now the algorithm has a lot more information about exactly how the ball has moved side to side in the last few actions.
How did we get here? We have been using a few metrics to describe to the algorithm how the ball got to where it is. In addition to velocities, we feed the algorithm information about how this possession has gone on since the last stoppage of play or turnover, and we tell it what the last action actually was (e.g., successful pass or dribble, turnover, clearance, etc.). We’ve now added three new descriptors:
Whether the prior action was a pass in the air.
The angle at which the ball arrived to its current location. This makes the most sense for completed passes, but also may add information about errant bounces.
The distance from which the ball arrived to its current location.
Instead of giving the algorithm the exact (x,y) location of the prior action, we try to help it out a bit by parameterizing that information into what we think is most relevant. It’s similar to how we parameterize the location of a shot as distance to goal and goal mouth available.
The net effect of these changes is not huge in aggregate, and we didn’t expect it to be. So long as the g+ framework is based only on on-ball actions, I expect most changes to methodology will lead to just minor variations in player and team-level g+.
Enjoy the soccer!