Signal and Noise in MLS
/Some Nate Silver guy wrote a whole book about "signal" and "noise" in data, so it must be important, right? Sports produce a lot of statistics, and it turns out that some of those statistics are pretty meaningless---that is, pretty noisy. A pitcher's ERA is sitting below 3.00 after eight starts, but he has more walks than strikeouts. Baseball sabermetricians will tell you that the low ERA is mostly noise, but that the high walk rate is a signal for impending doom. An MLS team leads the league in points per match, but it allows more shots than it earns for itself (note: this team is called "Montreal Impact"). Soccer nerds like me will tell you that its position in the standings is mostly noise, and that its low shots ratio is a signal for impending doom---or something worse than first place, anyway.
The reasoning behind both examples above is basically the same. Pitchers' ERAs, like soccer teams' points earned, are highly variable and unpredictable, while strikeout-to-walk ratios and shots ratios are more consistent. It's better to put your money on something consistent and easy to predict, rather than something variable and hard to predict. Duh, right?
So here's why we like shots data 'round these parts. Below I have provided two charts of MLS data, one from 2012 and one from 2013. I split each season into two parts and then measured the linear predictive power of each stat on itself. Did teams that scored lots of goals early in the season also score lots of goals later in the season? That's the kind of question answered here.
2012 MLS | Stat | R2 | Pvalue | 2013 MLS | Stat | R2 | Pvalue |
Blocked Shots | 37.1% | 0.6% | Shots off Goal | 34.8% | 0.8% | ||
Total Attempts | 26.1% | 2.5% | Total Attempts | 34.5% | 0.8% | ||
Goals | 20.3% | 5.3% | Shots on Goal | 29.4% | 1.7% | ||
Points | 20.1% | 5.5% | Points | 4.1% | 40.7% | ||
Shots on Goal | 18.2% | 6.9% | Blocked Shots | 1.7% | 60.0% | ||
Shots off Goal | 3.6% | 43.7% | Goals | 1.5% | 61.6% |
As an example of what this means, let's consider the attempts stat. Remember that an attempt is any effort in the direction of the goal, so basically an attempt is any shot---on target, off target, or blocked. In each of the past two seasons, MLS teams' attempts totals in the first half of the season were able to help predict their attempts totals in the second half, explaining 26.1% and 34.5% of the variability in second-half attempts, respectively. Those might not seem like high percentages of explanation, but the MLS season is short, and statistically significant predictors are hard to find.
In baseball, such "self-predictors" have been referred to as "stabilization." Stabilization is important because, as mentioned above, stabilization means that a stat is consistent, and that a team is likely to replicate its results in the future. This MLS season, points earned during the first 10 matches were essentially worthless at predicting points earned in the second 10 games. Even over the 34 games each team played in 2012, the stabilization for points earned was not as strong as that of attempts or goals scored.*
The next step is figuring out what predicts future points earned, since it does a pretty lame job of predicting itself. But I'll leave that for another post after I have gathered data going back a few more seasons. The number one takeaway here is that some stats can only tell us what happened, but not what will happen. There is another group of stats that are doubly important because they also stabilize---predicting themselves using smaller sample sizes. Those stabilizing stats (like shot attempts) are the signal amid the sea of noise known most places as "football."
Seattle has only played 21 games, so I cannot do 11-and-11 splits, yet. Also, as for why shots off goal and blocked shots have essentially switched places, I would wager that's more due to how they are (somewhat) subjectively categorized, but who knows.