Ranking NFL Teams Using Maximum Likelihood Estimation

I almost like these better.

The top four teams, by win-loss record, of the 2017 NFL Season were:

New England Patriots, 13–3
Pittsburgh Steelers, 13–3
Philadelphia Eagles, 13–3
Minnesota Vikings, 13–3

So which team was the best?

At first glance this question sounds unanswerable. If win-loss records are a complete evaluation of team’s performance over the course of the season, then all these teams are equally good, and there’s no differentiation to be had.

But as anyone who watches any sports knows, this isn’t true. Even disregarding that the Eagles lost their starting quarterback, Carson Wentz, in week 14, you’d still see most fans place the Patriots and the Vikings above the Eagles and the Steelers. Why? Because there are clear differences in the way teams play: in their strategy, in their strengths and weaknesses, and their focuses throughout the game.

In short, every win isn’t created equal.

And so, people come up with ranking systems. The Ringer, FiveThirtyEight, those dumbasses who make a job out of yelling on First Take, all have developed a system to evaluate and rank NFL teams based on data outside their records.

These vary in quality. As anyone with half a brain could imagine, Skip Bayless’s reasoning is god-awful. ESPN’s are pretty good, in my opinion, and I think FiveThirtyEight’s ELO-based system is both really interesting and gives consistently reasonable predictions.

Today, I’m going to ignore all of those systems and make my own, based on fundamentally different reasoning. Why? Because I can.

Let’s start with an example. Imagine a league with three teams: X, Y, and Z. And let’s say this series of results happened:

X beat Y.
Y beat Z.
Z beat X.
Y beat X.

Then, with four games played, our three teams have the following records:

X: 1–2
Y: 2–1
Z: 1–1

Our objective, then, is to assign each team a rating, R, that represents the true strength of that team.

First, let’s impose a constraint on every team’s rating:

That is, the probability that team X beats team Y is equal to its share of the total rating between those two teams. Then, we can arrive at the following:

The above states that the total probability of this season’s results is equal to the product of the probability of the outcome of each individual game. But, we know that these games resulted in the given outcomes: we observed them ourselves! So, the ratings that maximize the probability of our observed outcomes are the “true” ratings of each team.

That is, we want to find values for each R such that Pr(schedule) is maximized: the maximum likelihood estimate for the team ratings.

Actually finding this takes some math, which we’ll go through in the next section, but first, let’s think about this optimization problem intuitively. Imagine that you have three levers, each representing a team’s rating. We start with all three levers at 1, and then slowly start increasing X’s lever. You can see that X is both in the numerator and the denominator of the fraction we’re optimizing, so increasing it only increases Pr(schedule) to a point. We then jump to Y’s lever, and start increasing it until the value again begins to decrease. We then jump back to X’s, or move forward to Z’s, and so on, until we find the perfect balance between the three that maximizes the equation.

But of course, reality has no levers, and calculating our maximum takes math. We’re essentially going to derive gradient descent for this equation (and eventually, program it and run it!), which starts with finding a gradient.

Again, this is our equation:

The first thing you’ll notice is that this would be a frustrating expression to take the derivative of. We have division and multiplication, which will lead to some painful algebra. But, as it turns out, we don’t have any particular allegiance to this specific equation: we’d be fine optimizing an equivalent equation that’s easier to derive but behaves the same way as this one (that is, has the same extrema as this one).

I’m arbitrarily choosing the natural log, because it’s a monotonically increasing function that preserves the global extrema of the original equation, and it turns this complex blob of math into easier-to-handle additions and subtractions. Then:

With some algebra, this becomes:

The next step is to take the partial derivative with respect to each R in turn. I’m just going to give you these equations; verify my math if you’d like!

Now, we set each derivative to 0 (because we want to find the point at which the function is not changing, which in this function, is the maxima), and shuffle our equations into better forms:

And now we’re left with a system of three equations that represent the ratings for each of our teams. The obvious next step is solving this system, but even a cursory glance tells you that finding an algebraic solution would be quite difficult to arrive at, if not impossible. Matrix methods also won’t work here, as this is a nonlinear system that can’t be massaged into a better form.

So let’s step back for a second, and look at this differently. What if we just arbitrarily chose values for each R as an initial estimate? For example, let each rating be 1.

Then let’s plug those values back into each equation. That gives you:

What does this tell us? Well, it tells us that after one iteration of the above equations, we find that this is the current estimate of the maximum likelihood. And though it’s not a particularly good estimate, it already has expresses a lot of the intuitive ideas we discussed earlier. Y won against both X and Z, so it has the highest rating. X beat Y, but it also lost to Y, and it lost to Z, a team that Y beat, so it’s below Y. And Z is above X because it beat X and lost to Y, which X also lost to.

Now, we could run this same process again, plugging in these new R values and finding an updated estimate. There’s a whole century of academic literature, dating back to the 1920s, that states that eventually, these values will converge and you’ll find your maximum likelihood estimate.

Congratulations! You just ran gradient descent. Now, do the same thing, but for a full NFL season of 32 teams and 15 * 17 = 180 games.

Our math above solved for a crippled and abridged NFL season, but it’s easily extensible to a full season. Here are our final equations again:

Now, if you stare at this for a bit, you’ll be able to convince yourself that the general form of the above equations is:

where R_A is the rating of team A, W_A is the total number wins of team A, and G_Ω is the number of games A played against team Ω.

With our math above solved, we can finally write code to find maximum likelihood estimators for some real data. I’m going to use the 2017 NFL season as our example, and I’m going to write everything from scratch because again, I can.

Let’s start by first implementing the generalized rating equation that we want to optimize:

Here, current_weights is an array of length 32, initialized to 1 for our first iteration, where current_weights[i] is the current estimate for the rating of team i, games_matrix is a two-dimensional array such that games_matrx[i][j] is the number of times team i played team j, and wins_array is a one-dimensional array where wins_array[i] is the number of wins of team i.

Then, this is exactly the same as the equation we derived above. Now, to iterate it, we can do the following:

This function is fairly self-explanatory: it takes the games_matrixand wins_array params from above, initializes an current_weights array, and then while there’s still change between iterations, it travels along the gradient curve (that is our optimization_function) towards a local maxima.

Now, all we need is sample data! To achieve that, I used the incredibly helpful nflgame module, which pulls data from NFL.com’s GameCenter JSON feeds. Here’s the code below:

Now we can wire it all together:

Here are the results I got with the above code run on the 2017 NFL Season:

Pretty solid, I think. You can see ESPN’s final regular season power rankings here. Their top five were the New England Patriots, Pittsburgh Steelers, Minnesota Vikings, New Orleans Saints, and the Los Angeles Rams, which is pretty close to mine, especially as we didn’t use any information beyond game results. (For example, I’m sure Carson Wentz tearing his ACL has a large bearing on the Eagles falling out of their top five).

The central problem with this method thus far is that it treats every victory as equal, and that’s simply incorrect. Should the Eagles’ narrow 15–10 win in the divisional round against the Falcons be treated the same as their 38–7 thrashing of the Vikings in the conference championship this past week?Obviously, it shouldn’t. The Falcons lost a narrowly contested game, while the Vikings were completely destroyed. Indeed, it seems like the Falcons’ rating should rise with these two results, as they lost closely to a team that the highly-rated Vikings were annihilated by.

The simplest metric to express this is margin of victory: the point differential between the winning and losing teams. And an easy way to incorporate that into our method would be, when calculating wins for each team, to weight larger wins more and lower wins less.

After some trial and error, I came up with this formula:

where D is the score differential and S is the total number of points scored. What this essentially says is that every win is automatically given 1/2, and then, we add a factor proportional to the difference in points. If D is close to S, meaning the winning team won by a lot, we end up with a number close to 1/2, which leads to the total win weight approaching 1. If D is much less than S, however, we end up with a win weight approaching 1/2.

Every win will end up somewhere on this spectrum, but it’s worth will be proportional to the amount the winner won by. Here’s how it looks in code:

Here, game is an object from the nflgame module, which contains data about a particular NFL game.

If you remember, we counted our wins in the generate_matrices function. Here it is again, but calling the wins_update_formula function instead of simply adding 1 for each win:

And of course, here are results:

They changed a bit. I’ll leave a full comparison to you, but New Orleans hopped up a spot, while Carolina fell. Minnesota is still our regular season favorite, but Philadelphia rose one, and Atlanta dropped a bit.

There’s a thousand things more you can do. Find a way to take into account home field advantage, or more granular offensive/defensive metrics. Find a more intelligent formula for weighting wins. Run this on a larger dataset and maybe find the best teams of all time. Compare a team’s average regular season rating to its postseason ratings (the only teams that probably have enough data for this are the Steelers and the Patriots). Fix my shitty code to run faster on larger amounts of data.

But before you do that, there’s a problem with my current method. My more statistically-knowledgeable readers probably realized already that there’s autocorrelation issue with incorporating margin of victory: the better teams will both win more often and win by larger margins more often, which will inflate our ratings over time. A solution to this would be to find a win update formula that weighs constraint of victory more when a lower-rated team wins and less when a higher-rated team wins. I’ll channel the stodgy professor in me and leave that as an exercise (hint: use the natural log, Luke!).

Here’s the repository containing all the code used in this article. Otherwise, you can find me on Twitter at @AakashJapi, email at aakashjapi at gmail dot com, and Facebook at the above name. Feel free to contact me with ideas/thoughts/nfl memes/Patriots trash talk, or anything at all.