Goals Added: Deep Dive Methodology

By Matthias Kullowatz

This is part two of our introductory series on Goals Added (g+). Here is part one, where John first introduced it.

Ever since we founded American Soccer Analysis in 2013, I’ve hoped to construct a metric that credits players for actions all over the field, not just for goals and assists. I’ve always wanted something that could be used to ascribe values to players in a currency all soccer fans could understand. In the pre-ASA era, I got my fix analyzing baseball statistics, where analysts are spoiled by tons of publicly available data and a sport that, by rule, creates distinct plays largely independent of all other plays. I wanted to create something like Wins Above Replacement (WAR) for soccer players, but there was a huge lack of data available for soccer compared to baseball, and I thought my dream had died before it ever got started.

With the advent of event-by-event data some years ago, I had my chance, and it only took me five years to put something together. This is it, people. I’m really happy with this first version, but maybe the best part is that it can only get better. Below is my methodological introduction to what we’re calling Goals Added (g+).

Overview

Our objective was straightforward. (I say “our” now because this is really the product of many great minds, and probably thousands of person-hours). We wanted to derive a metric that values every action in units of goals, the currency of soccer. So many actions go unnoticed and/or underappreciated in traditional soccer statistics. Let’s measure them all, we said. And that’s what we did—well, for on-ball actions anyway.

We go into great detail in this primer, but below is a brief outline of the methodology:

  1. Create distinct “plays” referred to as possessions or chains.

  2. Determine the expected goals earned over the rest of the possession at any stage of that possession.

  3. Use statistical modeling techniques to estimate that rest-of-possession value in #2 above, based on details of the “game situation.” A game situation is defined based on where the ball is, which team has it, and how it got there. We use that model to estimate the value of each possession, as of just before every action.

    If we didn’t use statistical models in this way, then most possessions would simply have zero value, and all xG earned would be credited to a small portion of possessions.

  4. Take the xG earned for the rest of this possession and subtract the xG earned for the opposition on the subsequent possession. Thus our possession value is actually a net two-possession value. More detail on this later.

  5. Use the difference in expected possession values (from #4 above) before and after an action occurs (e.g. a pass) to derive the value of that action.

  6. Assign those values to players involved in the action. We believe that in this step we’ve derived some value allocation methods to push these Goals Added methodologies forward, toward more accurately valuing all players at all positions.

Methodology

Variables of Interest

Game situation details: any piece of information about where the ball is and how it got there might be helpful in determining how likely a goal is to be scored on the current possession and how likely a goal is to be scored on the opponent’s next possession. For example, an intercepted pass in a defender’s own penalty box is much costlier to their team than an intercepted pass at midfield.

Target variable: since we want our possession values to be in units of goals, our target variable is goals scored. We have chosen to use a proxy for goals scored, which is xG earned (using ASA’s shot-based xG model). The advantage of using shot-xG is that there are many more instances of shots than of goals, which can help any mathematical model home in faster on game situation details that add value (i.e. add the probability of scoring). In other words, the output for every single on-ball action in every game is a number indicating how much that action increased or decreased that player’s team’s likelihood to score a goal (minus the opponent’s likelihood to score on the ensuing possession).

Now we need to identify game situations that correspond to xG outcomes (even if those outcomes are zeroes).

Possessions

In order to determine a goals-based value of game situations, we felt like we needed some context in which to consider each game situation—a unit of time, you might say. We settled on “possessions.” We broke game situations down into possessions, defined as chains of actions that end in either a shot or a turnover. We considered:

Too big: A whole game. If our unit of time were a whole game, we would basically summarize how many of each type of game situation there were in each game, and how many goals or xG were scored that game. That method would require a lot of subjective decision making about how to cluster game situations into distinct types, and it wouldn’t be very precise.

Too small: A single action. If our unit of time were a single action, then the only actions with positive value would be shots. It would just be a shots model, but we already have one of those.

Just right: A possession chain. We’ve done lots of work with possession chains in the past, and they’re defined as a sequence of possessing, on-ball actions by a team without allowing the opponent a possessing action. An opponent’s interrupting actions don’t necessarily end the possession, by our definition, unless it is accompanied by a pass attempt, dribble attempt, shot attempt, or foul earned. For example, a possession chain for Team A could begin (and a possession chain for Team B would end) when Team A intercepted Team B’s pass and then attempted a pass. It might further include Player A dribbling up field and then sending in a cross that a teammate heads on goal. A new possession chain for Team B would then begin when Goalkeeper B catches the headed shot by Team A and plays the ball out to a teammate.

Using a possession chain as our “distinct play” allows us to observe how each non-shot action might increase or decrease the probability of a goal a few actions before a shot is even taken. Using a statistical model, the details of each game situation are correlated with the xG earned in that particular possession (minus the xG earned by the opponent in the following possession). This allows us to assign a g+ value to each action, and thus the players taking those actions.

So we’re defining the start of a new possession chain as follows: When a team makes their first possessing on-ball action, thus ending the opposing team’s possession. Here are some possible actions that end one possession chain and start a new one for the opposing team:

  • Pass attempt

  • Dribble attempt - This is an attempt by a player to beat an opponent when they have possession of the ball. A successful dribble means the player beats the defender while retaining possession, unsuccessful ones are where the dribbler is tackled.

  • Carry (derived) - When a player receives a completed pass or clearance, and either loses the ball, or produces another action at least 5 meters from the point of receipt

  • Foul suffered

  • Shot attempt

  • A keeper’s claim

  • A clearance, when it follows other possessing actions - Most clearances are considered defensive or “interrupting” actions

  • An errant touch, when it follows other possessing actions

Building Context

Each on-ball action can help describe the possession chain in ways that determine possession value. Not only does the ball’s location matter, but also if the ball just arrived from a through ball or a clearance; if the ball was just won from the opposition, or if it’s the 27th pass of the chain; if the possession has been moving rapidly up the field in the past few actions, or stagnating. By stringing actions together, we can derive descriptive columns in the data that can be used to inform the rest-of-possession value of a given game situation.

Deriving Possession and Action Values

If a particular game situation is valuable, then it should more often correspond to the team earning good shots in the rest of the possession. Consider a line-splitting pass into the attacking third. The player who receives this pass is likely in a dangerous position. To derive this game situation’s value, we want to try to calibrate a league-wide value of the remainder of the possession following that line-splitting pass. We’ll sift through the data to find the league’s average xG value during the remainder of possessions where players find themselves on the ball in the middle of the pitch in the attacking third, following a quick vertical advancement of the ball. We refer to this value as a “pre-action” or rest-of-possession value. It’s the value of the remainder of the possession before the player has done anything with the ball. It could just as easily be called a “post-action” value of the prior action, so don’t get too hung up on that.

The next step is converting these rest-of-possession values—these mid-possession snapshots—into action values: the value of a pass, dribble, clearance, etc. To value a specific action, like that line-splitting pass, then we simply need to subtract the pre-action value before the pass from the pre-action value of the next game situation after the pass. We refer to that difference as the “action value,” the change in successive possession values.

(Note: See Appendix for a detailed guide of how pre-action values are calculated)

Why Two Possessions?

We determined it was important to track not only the probability of scoring on the current possession, but also the probability of conceding on the subsequent possession (the one in which your opponent is trying to score).

The reasoning for this should be intuitive. Soccer is a continuously flowing sport in which possession shifts between teams throughout the game, and the location of the ball, or the field position of play, impacts these probabilities. As the ball approaches one goal in a possession, not only does the probability of the ball ending up in that goal increase on the current possession, but the probability of the ball ending up in the other goal decreases for the subsequent possession. Accordingly, each game situation is calibrated with both probabilities recorded so that the value of changes in both probabilities can be assigned to the actions which change the game situations, and then ultimately if one wishes, to the players involved in the actions.

For a more tangible example, consider the valuation of a clearance. In a one-possession valuation methodology, all game situations have a positive value because a team cannot score negative goals—or earn negative xG—during a possession. Thus the possession value before the clearance would be positive. Because the vast majority of clearances end up with the opposing teams, the ensuing possession value would be positive for the opposing team—i.e. negative for the clearing team—because they just gained possession. Translation: these clearances would all be scored with negative values. Because it’s fairly obvious that at least some clearances are necessary and valuable, this one-possession framework would be unable to properly value such actions. In fact, our two-possession methodology scores about 50% of all clearances as having positive value for the clearing team because it relatively improves their position.

Applications

Now, at the start of every action, our model has calibrated an expected remaining possession value, the pre-action value. If a player starts his action by receiving the ball in a zone worth 0.02 xG, and then makes a line-splitting pass into a region that is calibrated to be worth 0.05 xG, this player has just completed a pass which added 0.03 xG of value (0.02 + 0.03 = 0.05 xG), even though a shot may never get taken (note again that these possession values are derived as the probability of the possessing team scoring minus the probability of the defending team scoring on its next possession). This is perhaps the most obvious application of such a model: players can earn value for good plays even if their teammates later screw up the possession! We were certainly not the first people to think of this, and later this week we’ll publish additional articles that show how this model fits into the collection of existing methodologies that dive into non-shot actions.

The table below shows the top 10 players in 2019 in goals added (g+), ranked by one version of our g+ metrics, value above average. As we mention later in this section, we control to some extent for a player’s position. We calculate a unique average positional value against which to compare each player, based on that player’s distribution of minutes played at various positions. It’s good to see two expected faces up on top, but also encouraging that we see three players on teams with negative overall expected goal differentials (Elis, Blanco, Gil).

RankPlayerTeamPosition Minutesg+ Value
1Carlos VelaLAFCW28609
2Zlatan IbrahimovicLAGF27458.8
3Alberth ElisHOUW20474.2
4Sebastian BlancoPORW26434
5Jack ElliottPHICB32183.8
6Josef MartinezATLF26963.8
7Eduard AtuestaLAFCDM27573.8
8Carles GilNERW32083.5
9Mark-Anthony KayeLAFCCM26183.5
10Anton TinnerholmNYCFB27863.3

By valuing specific actions, we can potentially evaluate the pass distribution of goalkeepers and center backs, perform a cost-benefit analysis on the dribbling attempts of a particular player, or determine value lost due to turnovers. For example, we could evaluate crossing actions from various locations or we could compare short corners to long corners. We’ll likely explore these topics and many more in the coming months.

We could sum action values up to the team level, and validate the method to some degree by correlating a team’s aggregate action values to its success, that is, its actual goal differential (GD) and shot-based expected goal differential (xGD). In fact, we’ve done that below. The table shows team differentials in 2019 for g+, total expected goals, and total goals (for minus against).

RankTeamg+xGDGD
1LAFC354350
2SJE116-3
3ATL81114
4FCD735
5PHI798
6MIN797
7NYC61224
8CHI6136
9LAG312
10SKC33-21
11TOR107
12SEA1-33
13RSL-2-4-6
14RBNY-301
15HOU-3-3-14
16MTL-4-5-15
17POR-5-12
18DCU-6-124
19ORL-61-6
20CLB-7-6-10
21COL-8-81
22NER-16-12-8
23CIN-17-29-45
24VAN-19-28-18

Because Goals Added (g+) values were guided by expected goals, there is a very strong correlation between a team’s accumulations of the two in any given season (0.93 across team-seasons in the dataset). Now, that doesn’t imply that we’ve created xG over again. We have now allocated that xG credit to every player on the team, rather than simply those who shoot, and to all regions of the field, rather than just the attacking third. For example, we can use g+ to measure how well a team controls the middle of the field by simply adding up the team’s g+ for and subtracting g+ against for actions that occur in the middle third of the field.

We did just that, calculating each team’s g+ differential in the middle third of the field over the first 17 games of the season. Then we calculated each team’s corresponding goal differential in the second 17 games of the same season. The correlation between the two across 147 team-seasons was 0.39. Though not statistically significantly greater, it is greater than the correlation between past team xG and future GD (0.33), and that was just one slice of the data. The point is, we have added flexibility now in both describing how and where the game is played with a metric calibrated on the scale of goals, and we can use all those new slices to predict future performance.

Defensive Actions

Our methodology naturally lends itself to valuing defensive, or interrupting, actions. We sometimes use the word “interrupting” to imply that the defensive action may not win the ball, but it always interrupts play, which can be valuable. Let’s use two common examples to discuss how we assign value to defensive and interrupting actions.

Tackle: A player for Team A receives a pass on the wing near the attacking third, but he is subsequently tackled. Then Team B corrals the loose ball and gains possession.

Our possession model scores the expected possession value just before the tackle at around +0.005 expected goals. It’s not a particularly dangerous position, but it’s better than Team B’s predicament, and thus positive. We then score the subsequent possession value for Team B, the new possessing team, at the point of ball recovery. It’s about +0.000 net possession value for them—better than when they didn’t have the ball just above, and were in the hole 0.005 goals, but not quite what you’d call an advantageous position either.

Team A’s player lost 0.005 expected goal value by losing possession, so naturally we debit the player on Team A and give that value to the tackler from Team B (and, if applicable, we share credit with the player who recovers the ball). In general, a defensive action gets the difference between the possession value just before the action and the possession value just after, which is to say, the defensive action is credited with what the failed offensive action was debited. Accountants rejoice. In this example, it’s the 0.005 value of the pass received before the tackle minus the 0.000 value of the subsequent possession, meaning the tackle is worth +0.005 g+.

Clearance: We can use this very same methodology to score any defensive action, even one that does not lead to a turnover (an “interrupting” action, if you will). Consider a situation where the attacking team whips in a cross from the wing, and the defensive player clears it from immediate danger, back to the attacking team in the middle third of the field.

Assuming the cross came in from the wing, about where the box starts, our possession model scores such possession on the wings at about +0.015 Following the clearance, the attacking team is now back about 50 yards from goal, and its expected possession value drops to +0.005. The failed cross loses 0.010 possession value, and the clearer gets credited with the same 0.010 for effectively interrupting the possession.

Edge Cases and Manual Adjustments

The steps described above produce our conceptual underlying action value model, but in evaluating this evaluation tool, we concluded that some adjustments were needed.

Pass-receipt allocation

Virtually all soccer actions, offensive and defensive, are functions of the efforts of multiple players. However, the value of many of these actions can still reasonably be attributed to a single person, both because that player is responsible for a bulk of the credit and because, well, the action only comes attached to one player in the dataset. Perhaps the most obvious example where value should be attributed to more than one player is on completed passes; every successful pass needs a receiver, someone to get available and control the pass.

Reasonable people can disagree about the responsibilities of passers and receivers in various zones on the field.

But we feel safe in eliminating a few options. Giving all 100% of completed pass values to either the passer or recipient seems too extreme, and in both cases leads to noticeable over- and under-valuations of certain positions. Additionally, thinking about the actual game of soccer, a playmaker cannot feed a through ball into a dangerous area if another attacker doesn’t provide a dangerous run in the first place.

Perhaps the most moving argument for such allocation comes from the same author of the above twitter poll, Atlanta’s own Tiotal Football.

If your model for how to evaluate a forward treats him as the final stationary machine in the assembly line of a conversion chain, his programmed task to turn opportunities into goals, or even to turn opportunities into shots, then your model is not describing soccer. In soccer the attacking player is himself a creator of the opportunity, not just the final execution or failure to capitalize on the opportunity another player has created.

Conveniently, we already have a model for evaluating the likelihood of an individual pass being successfully completed to a teammate, our xPass model, and we decided that allocating to the passer the share of the value that corresponds with the expected pass completion probability (xPass%) was fair, with the receiving player receiving the difference (100%-xPass%). If you think about the difficulty of a pass, the xPass score negatively correlates with the expected goals value (“action value”) of a completed pass. In other words, the easier the pass is to complete, the lower the g+ value.

Harder passes to complete tend to be those that are the most valuable (given they are completed; think about through balls and crosses) and tend to require more work and skill by the receiver. The most likely completed passes are attached to the least impactful changes in goal scoring probabilities from the passes themselves, and conversely passes into the penalty area (which on average are more likely than not to be disrupted by the defense) are allocated more equally between the passer and the receiver who has found the space in the dangerous area to receive and shoot, dribble, or pass (or be fouled).

Note that many passes in the run of play are backwards (e.g. from a central midfielder to a central defender), and in general the model would describe those events as decreasing the goal scoring probability in the possession. Such passes are also generally ascribed xPass completion scores closer to 100%, such that by allocating the negative action value of a completed pass backwards via xPass score, the receiver of the ball is not “penalized” much at all for doing their job, i.e. being available behind the ball, and instead the passer of the ball is assigned the lion’s share of the negative pass. If this seems unfair, consider that players have available to them a wide distribution of pass options throughout a game or season, from safe to dangerous, and that the best players will complete more dangerous passes, but also will choose the safe pass over the unnecessarily risky pass more often.

Above we described how we split up the value assigned to completed passes between passers and receivers. In doing so, we only give the passer a fraction of the credit for a complete pass. Thus, it only seems fair that we give the passer a fraction of the (usually negative) credit for an incomplete pass. A cross into the box typically has about a 30% completion rate, but it’s worth around 0.15 g+ when completed, a relatively high value compared to many soccer actions. If completed, the passer would earn 0.30 x 0.15 = 0.045 g+, while the receiver would earn the rest, 0.105 g+. 

To summarize, we calculate the value of every pass using our basic underlying methodology—the change in rest-of-possession values from before and after the pass. Then we scale the passers’ credit and debits down, acknowledging that passer success and failure is also a function of their teammates. We use xPass% to scale all passes for the passer. In the case of an incomplete pass, the recipient g+ effectively goes into the ether, for now.

Scoring shots

Often forwards receive the ball in zones worth an expected 10% or 20% of a goal. When they score in such cases, it is tempting to give them all the remaining credit up to 100% (technically 99% once you’ve subtracted the 1% expected goals the opposing team has upon kickoff). On average, shooters would break even by design—that is shooting g+ would average 0.00—but the g+ leaderboards would be dominated by those players who finished well in a given year. Considering finishing outcomes regress heavily toward league average year-over-year, it seems like it would undermine the whole metric if the leaderboard were constructed on the whims of chance.

So here’s what we do. The shooter is credited with the expected goal value of the shot, regardless of outcome, and debited the expected possession value before the shot—the opportunity cost of taking the shot in the first place. On average, this difference is about 0, as noted above. That's right, unlike in the solely xG models that the soccer analytics movement has relied on for years, shooting has almost no net value. Then we try to tease out what finishing skill a player may have by crediting players who get good shots on target (or off the post). Such shots produce significantly greater expected goals from secondary opportunities (rebounds and corners). This shot placement credit comes from our new xRebound model.

The average shot that misses the goal and the posts altogether is worth -0.005 value by our method, while the average shot on target is worth +0.015. If you’re worried that forwards aren’t getting enough credit, recall Tiotal Football’s word art above. Forwards are getting a lot of credit for being available and receiving the ball (or winning headers) in dangerous places in front of goal. To some extent, we are imposing our belief on the methodology that the hard work is done in getting the ball in dangerous locations, and that the shot is a formality. Indeed, here is the 2019 leaderboard for g+ on pass receipts, through our aforementioned pass receiving allocation method. Maybe you’ve heard of some of these guys.

RankPlayerMinutesReceiving Value
1Carlos Vela29859.23
2Zlatan Ibrahimovic28629.03
3Alberth Elis21384.34
4Sebastian Blanco27684.24
5Josef Martinez28083.86
6Jack Elliott33743.8
7Eduard Atuesta28743.74
8Carles Gil33433.55
9Mark-Anthony Kaye27293.5
10Anton Tinnerholm29193.38

Penalties: earned and taken

Ah, penalties. The one great flaw of the beautiful game. Penalties are converted into a goal nearly 80% of the time in MLS, so earning one is probabilistically similar to scoring. You shouldn’t be surprised then that we have opted to reduce our methodology’s inherent action values of both earning a PK and finishing one.

A penalty is a rare event in our data in that the taker didn’t necessarily earn the kick (and if they did something to help earn the kick, they received credit for those specific actions earlier). But most game situations in which a penalty is awarded don’t have a rest-of-possession value of 0.80 just before the whistle. For the same reason we don’t want to give a shooter around 0.80 or 0.90 g+ when they score in the run of play, we don’t want to take 0.80 g+ away from a player when they miss a penalty (or give them 0.20 when they score, for that matter). In this case, because it’s hard to actually figure out which PK takers are truly good, we have opted to give PK takers 0.00 g+ credit, regardless of outcome.

Earning a PK, however, is something we want to reward. The typical value of earning a PK, using the underlying, unadjusted methodology, is about 0.60 g+. That is because the rest-of-possession value just before the foul is usually between 15% and 20% probability of scoring, while a PK is converted with between 75% and 80% probability. The typical difference is thus about 0.60 expected goals. We have opted instead to give PK earners 0.10 g+. Why so little? Basically, we believe that most PKs are largely incidental, in that the player earning the PK was mostly in the right place at the right time, and symmetrically the player conceding the penalty was mostly in the wrong place at the wrong time. Why 0.10 g+? We thought about the value of the unrealized open-play opportunity—if the player were not fouled—as similar to completing a successful dribble in the box to open up space. The average value of such a successful dribble in the middle of the box, inside the width of the six, is 10%.

Other set pieces

At this point, we have opted to allow the underlying methodology handle all set pieces (other than PKs). The passer or shooter has a more balanced distribution of options and corresponding action values, which can’t be said about a PK. As shown in the table below, the action value generated from set pieces, on average, is in line with typical passes and shots.

ActionCountAvg g+ value
Shot62,3910.008
Corner36,7630.006
FK (Shot)2,3530.004
FK (Pass)112,0990.002
Other Action2,423,4800.001
Pass2,163,9890.001
Throw in203,3480.001

Wrapping Up

If you include the appendix, we’ve spilled 5,000 words on how we ascribe goal-equivalent values to on-ball actions. It’s a lot of words. I think we can summarize them like this.

There is a lot of statistical theory and logic that make the underlying methodology a solid foundation for assessing action values and player contributions. And then we’ve tweaked the methodology to handle some key edge cases, like receiving passes, taking shots, and earning and taking PKs. These tweaks are the product of long discussions among many ASA contributors to adjust the methodology to reflect what’s actually happening on the field. Finally, as you’ll see later this week, we’ve produced insightful, and beautiful, visuals to convey what this metric can tell us.

We hope, like us, you’ll learn something new about the game you love.

————————————————————————————————————————————————————————-

Appendix: Mathematical Models

There are many ways to calculate the possession pre-action expected goals values, or to “sift through the data,” as we suggested vaguely before. Here are some considerations.

  • Using the “line-splitting pass” example from before, we could segment out all line-splitting passes that ended up roughly in the location where the earlier one was received—middle of the field, just outside the box (zone 14). Then we could further filter down to such passes that also came from similar possessions to this one (e.g. same pace of play, pass origin). And then we could average out the amount of possession xG earned on all such possessions following such a line-spitting pass. In many cases, where such actions are frequent in the data, this method would work well. However, in some zones, such filtering might lead us to just a handful of possessions and a bad estimate of value.

  • Enter regression. This is what regression was made for, to identify details about an observation and to calibrate the relationship between those details and the target metric. In this way, you don’t have to find passes within a few feet of this one, and instead you can derive a basic rule for, say, how proximity to the goal affects possession value. A linear regression model will calibrate coefficients for each predictor variable you create. So “proximity to goal” could be one variable, and the regression will fit a negative coefficient for you—the shorter the distance from the ball to the goal, the greater the expected possession value. We could have a variable for horizontal distance to center of pitch, and indicators for whether the ball was received from a turnover, short pass, long ball, through ball, etc.

However, linear regression has its shortcomings. What if you have the ball on the attacking endline of the pitch just outside the side of the box? Your proximity to goal might be about the same as if you were straight on just outside the top of the box, but those two positions obviously have different values.

  • Let’s combine the concept of regression in option #2 with that of segmentation of game situations like the line-splitting pass in option #1. There might not be many on-ball actions following, for example, 1) a line splitting pass, 2) received close to a specific location on the field, 3) following 5+ passes that don’t advance the ball closer to the goal, and 4) directly following a switch pass. But if you were to take subsets of three of those four conditions, then there would be more observations in each bucket. You could look at actions satisfying conditions 1-2-3, then 1-2-4, then 1-3-4, and then 2-3-4, and calculate four separate average xG values, one from each filtered bucket. Then you could weight those four averages together to get a more reliable estimate of the value of this particular game situation.

This “partial subsetting” described in option #3 is basically what a forest of decision trees does. At each “decision,” individual tree models look for places to split a predictor variable into two groups such that the average xG value earned in one split is significantly greater than in the other one. Then the tree looks for another split that makes a big difference. Then another one, etc. So maybe the first split is made at the 18-yard box (e.g. x > 97 for a 115-yard field length), and then second split is made to include locations within 20 yards of the center of the pitch (e.g. |y – 40| < 20 for an 80-yard field width). So far in this example, the model has now defined four subsets of the data, combining vertical proximity to goal and horizontal proximity to the center of the pitch. We’ll keep growing trees with up to five or 10 splits each, allowing the algorithm to effectively make and combine thousands of segments, in order to calibrate the value of any number of situations using league average possession xG values earned across a few splits at a time.

We use xGBoost, a gradient-boosted forest of decision trees (don’t be confused by the name, it has nothing to do with expected goals; that’s just a coincidence). Instead of “growing” trees randomly, xGBoost grows them one at a time, taking all previous trees into account when growing the next one. The algorithm continues to build these trees, making more and more combinations of splits, until it reaches a stopping criterion—typically a point at which it can’t identify any more splits that would significantly improve the model. The predictor variables we feed into the model are as follows:

  • (x,y) location of the game situation

    • Additional penalty box indicator to home in on that important region more easily

  • Action number: number of actions in the possession chain to this point (pre-dead ball)

  • Prior action type that led to this game situation

    • Turnover, pass, dribble, carry, shot, foul suffered, ball recovery, defensive action

  • Longball indicator: was the prior pass a longball?

  • Cross indicator: was the prior pass a cross?

  • Through ball indicator: was the prior pass a through ball?

  • Header indicator: For passes and shots, is the action a header? There is no explicit indicator for a ball in the air, which significantly reduces the value of the possession, especially near the goal. So we use the current action’s header indicator.

  • Corner indicator: is this action a corner kick?

  • Actions since corner kick: number of actions since a corner kick was taken

  • Throw in indicator: is this action a throw in?

  • Actions since throw in: number of actions since a throw in was taken

  • Free kick indicator: is this action a free kick?

  • Actions since free kick: number of actions since a free kick

  • Goal kick indicator: is this action a goal kick?

  • Actions since goal kick: number of actions since a goal kick

  • Vertical velocity of possession: the vertical (i.e. field length) distance covered in the last four actions (in the last two or three actions early on in possessions), divided by time

  • Horizontal velocity of possession: the horizontal (i.e. field width) distance covered in the last two actions, divided by time