Model Update: Coefficient Blending
/By Matthias Kullowatz (@mattyanselmo)
With our most recent app update, you might notice that some numbers in the xGoals tables have changed for past years where it wouldn’t normally make sense to see changes. As an example, Josef Martinez had 29.2 xG in 2018, but updated app shows 28.7 (-1.7%). No, this is not an Atlanta effect, though I can understand why you might support such an effect. Gyasi Zardes lost 0.5 xG as well (-2.4%), and no one dislikes Columbus.
We have updated our xGoal models with the 2018 season’s data, and that is the culprit of all the discrepancies since the last version of the app. I have already cited the largest two discrepancies by magnitude, so this isn’t some major overhaul of the model. In fact, only 2018’s xG values have been materially adjusted.* The new model estimated 35.6 fewer xGoals in 2018 than it did before, equivalent to a 2.8% drop.
Before we get to the why, I’ve included a few tables showing the largest effects. Because we use a logistic regression model, and because coefficients in such a model are more multiplicative in impact than they are additive, it’s not surprising that the largest raw discrepancies are among those with the most xG in 2018.
Season | xG (before) | xG (now) | Difference | |
---|---|---|---|---|
The greatest ratio discrepancies are, conversely, among those with the fewest xG (table filtered to those with more than 1.0 xG so that you might actually recognize a few names).
Season | xG (before) | xG (now) | Ratio | |
---|---|---|---|---|
Because there is some averaging that goes on at the team level, most teams were impacted very similarly, losing between 1.0 and 2.0 expected goals (both for and against). The greatest differences in expected goal differential (xGD) have changed by less than 1.0 xGD.
Season | xGF (before) | xGF (now) | Difference | |
---|---|---|---|---|
Season | xGA (before) | xGA (now) | Difference | |
---|---|---|---|---|
Season | xGD (before) | xGD (now) | Difference | |
---|---|---|---|---|
As for the cause of this one-time discontinuity in xG, our model fits individual coefficients, or effects, to each season. This guarantees that the sum of goals is equal to the sum of xGoals in any given year, which helps to normalize things like xGD from season to season. Because expected goals are a key part of our predictive models, controlling for the goal scoring environment of a given season is important. Baseball fans out there may notice this is similar to “+” stats, like OPS+, which adjust for era (among other things). We are simply defining an era as a season.
The issue arises when a new season starts, and we don’t have enough data to properly calibrate an effect for that new season. Last year, we just used the 2017 coefficient applied to all shots in 2018, but now we are seeing the sudden effect of giving 2018 its own coefficient. So what are we doing to do about it?
Well it’s not a huge problem. The rankings on all of our app tabs aren’t going to change noticeably, and our predictive models won’t produce materially different probabilities for game outcomes. But we will ensure that the discrepancy is less sudden going forward. Every match week in which we integrate new data into the app, we will refit the model with a new-season coefficient. We will then weight the prior week’s model predictions against the new week’s model predictions to produce xG estimates for the new season. The weights will be determined by how far through the season we are at that time. Thus, each week there will be tiny changes to xG calculations, but you won’t notice; it won’t take half an expected goal away from Josef or anything like that.
*In aggregate, no season other than 2018 saw a change in xGoals by more than 0.1…TOTAL. Only a handful of players in seasons before 2018 will see their xG figures change by a single tenth of a goal. The rest won’t change at all.