Ranking NFL Teams Using Maximum Likelihood Estimation by@AakashJapi

January 18th 2018 6,240 reads

The top four teams, by win-loss record, of the 2017 NFL Season were:

**New England Patriots**, 13โ3**Pittsburgh Steelers**, 13โ3**Philadelphia Eagles**, 13โ3**Minnesota Vikings**, 13โ3

So which team was the best?

At first glance this question sounds unanswerable. If win-loss records are a complete evaluation of teamโs performance over the course of the season, then all these teams are equally good, and thereโs no differentiation to be had.

But as anyone who watches any sports knows, this isnโt true. Even disregarding that the Eagles lost their starting quarterback, Carson Wentz, in week 14, youโd still see most fans place the Patriots and the Vikings above the Eagles and the Steelers. Why? Because there are clear differences in the way teams play: in their strategy, in their strengths and weaknesses, and their focuses throughout the game.

In short, every win isnโt created equal.

And so, people come up with ranking systems. The Ringer, FiveThirtyEight, those dumbasses who make a job out of yelling on *First Take*, all have developed a system to evaluate and rank NFL teams based on data outside their records.

These vary in quality. As anyone with half a brain could imagine, Skip Baylessโs reasoning is god-awful. ESPNโs are pretty good, in my opinion, and I think FiveThirtyEightโs ELO-based system is both really interesting and gives consistently reasonable predictions.

Today, Iโm going to ignore all of those systems and make my own, based on fundamentally different reasoning. Why? Because I can.

Letโs start with an example. Imagine a league with three teams: **X, Y, and Z**. And letโs say this series of results happened:

**X**beat**Y**.**Y**beat**Z**.**Z**beat**X**.**Y**beat**X**.

Then, with four games played, our three teams have the following records:

**X**: 1โ2**Y**: 2โ1**Z**: 1โ1

Our objective, then, is to assign each team a rating, *R*, that represents the *true* strength of that team.

First, letโs impose a constraint on every teamโs rating:

That is, the probability that team X beats team Y is equal to its share of the total rating between those two teams. Then, we can arrive at the following:

The above states that the total probability of this seasonโs results is equal to the product of the probability of the outcome of each individual game. But, we know that these games resulted in the given outcomes: we observed them ourselves! So, the ratings that maximize the probability of our observed outcomes are the โtrueโ ratings of each team.

That is, we want to find values for each *R *such that *Pr(schedule) *is maximized: the ** maximum likelihood estimate **for the team ratings.

Actually finding this takes some math, which weโll go through in the next section, but first, letโs think about this optimization problem intuitively. Imagine that you have three levers, each representing a teamโs rating. We start with all three levers at 1, and then slowly start increasing Xโs lever. You can see that X is both in the numerator and the denominator of the fraction weโre optimizing, so increasing it only increases *Pr(schedule) *to a point. We then jump to Yโs lever, and start increasing it until the value again begins to decrease. We then jump back to Xโs, or move forward to Zโs, and so on, until we find the perfect balance between the three that maximizes the equation.

But of course, reality has no levers, and calculating our maximum takes math. Weโre essentially going to derive gradient descent for this equation (and eventually, program it and run it!), which starts with finding a gradient.

Again, this is our equation:

The first thing youโll notice is that this would be a frustrating expression to take the derivative of. We have division and multiplication, which will lead to some painful algebra. But, as it turns out, we donโt have any particular allegiance to this specific equation: weโd be fine optimizing an equivalent equation thatโs easier to derive but behaves the same way as this one (that is, has the same extrema as this one).

Iโm arbitrarily choosing the natural log, because itโs a monotonically increasing function that preserves the global extrema of the original equation, and it turns this complex blob of math into easier-to-handle additions and subtractions. Then:

With some algebra, this becomes:

The next step is to take the partial derivative with respect to each *R *in turn. Iโm just going to give you these equations; verify my math if youโd like!

Now, we set each derivative to 0 (because we want to find the point at which the function is not changing, which in this function, is the maxima), and shuffle our equations into better forms:

And now weโre left with a system of three equations that represent the ratings for each of our teams. The obvious next step is solving this system, but even a cursory glance tells you that finding an algebraic solution would be quite difficult to arrive at, if not impossible. Matrix methods also wonโt work here, as this is a nonlinear system that canโt be massaged into a better form.

So letโs step back for a second, and look at this differently. What if we just arbitrarily chose values for each *R *as an initial estimate? For example, let each rating be 1.

Then letโs plug those values back into each equation. That gives you:

What does this tell us? Well, it tells us that after one *iteration* of the above equations, we find that this is the current estimate of the maximum likelihood. And though itโs not a particularly good estimate, it already has expresses a lot of the intuitive ideas we discussed earlier. Y won against both X and Z, so it has the highest rating. X beat Y, but it also lost to Y, and it lost to Z, a team that Y beat, so itโs below Y. And Z is above X because it beat X and lost to Y, which X also lost to.

Now, we could run this same process again, plugging in these new *R *values and finding an updated estimate. Thereโs a whole century of academic literature, dating back to the 1920s, that states that eventually, these values will converge and youโll find your maximum likelihood estimate.

Congratulations! You just ran gradient descent. Now, do the same thing, but for a full NFL season of 32 teams and 15 * 17 = 180 games.

Our math above solved for a crippled and abridged NFL season, but itโs easily extensible to a full season. Here are our final equations again:

Now, if you stare at this for a bit, youโll be able to convince yourself that the general form of the above equations is:

where *R_A* is the rating of team *A*, *W_A* is the total number wins of team *A*, and *G_โฆ* is the number of games A played against team *โฆ*.

With our math above solved, we can finally write code to find maximum likelihood estimators for some real data. Iโm going to use the 2017 NFL season as our example, and Iโm going to write everything from scratch because again, *I can*.

Letโs start by first implementing the generalized rating equation that we want to optimize:

Here, `current_weights`

is an array of length 32, initialized to 1 for our first iteration, where `current_weights[i]`

is the current estimate for the rating of team *i,* `games_matrix`

is a two-dimensional array such that `games_matrx[i][j]`

is the number of times team *i *played team *j*, and `wins_array`

is a one-dimensional array where `wins_array[i]`

is the number of wins of team *i.*

Then, this is exactly the same as the equation we derived above. Now, to iterate it, we can do the following:

This function is fairly self-explanatory: it takes the `games_matrix`

and `wins_array`

params from above, initializes an `current_weights`

array, and then while thereโs still change between iterations, it travels along the gradient curve (that is our `optimization_function`

) towards a local maxima.

Now, all we need is sample data! To achieve that, I used the incredibly helpful nflgame module, which pulls data from NFL.comโs GameCenter JSON feeds. Hereโs the code below:

Now we can wire it all together:

Here are the results I got with the above code run on the 2017 NFL Season:

Pretty solid, I think. You can see ESPNโs final regular season power rankings here. Their top five were the New England Patriots, Pittsburgh Steelers, Minnesota Vikings, New Orleans Saints, and the Los Angeles Rams, which is pretty close to mine, especially as we didnโt use any information beyond game results. (For example, Iโm sure Carson Wentz tearing his ACL has a large bearing on the Eagles falling out of their top five).

The central problem with this method thus far is that it treats every victory as equal, and thatโs simply incorrect. Should the Eaglesโ narrow 15โ10 win in the divisional round against the Falcons be treated the same as their 38โ7 thrashing of the Vikings in the conference championship this past week?Obviously, it shouldnโt. The Falcons lost a narrowly contested game, while the Vikings were completely destroyed. Indeed, it seems like the Falconsโ rating should rise with these two results, as they lost closely to a team that the highly-rated Vikings were annihilated by.

The simplest metric to express this is margin of victory: the point differential between the winning and losing teams. And an easy way to incorporate that into our method would be, when calculating wins for each team, to weight larger wins more and lower wins less.

After some trial and error, I came up with this formula:

where *D* is the score differential and *S *is the total number of points scored. What this essentially says is that every win is automatically given 1/2, and then, we add a factor proportional to the difference in points. If *D* is close to *S*, meaning the winning team won by a lot, we end up with a number close to 1/2, which leads to the total win weight approaching 1. If *D *is much less than *S, *however, we end up with a win weight approaching 1/2.

Every win will end up somewhere on this spectrum, but itโs worth will be proportional to the amount the winner won by. Hereโs how it looks in code:

Here, game is an object from the nflgame module, which contains data about a particular NFL game.

If you remember, we counted our wins in the `generate_matrices`

function. Here it is again, but calling the `wins_update_formula`

function instead of simply adding 1 for each win:

And of course, here are results:

They changed a bit. Iโll leave a full comparison to you, but New Orleans hopped up a spot, while Carolina fell. Minnesota is still our regular season favorite, but Philadelphia rose one, and Atlanta dropped a bit.

Thereโs a thousand things more you can do. Find a way to take into account home field advantage, or more granular offensive/defensive metrics. Find a more intelligent formula for weighting wins. Run this on a larger dataset and maybe find the best teams of all time. Compare a teamโs average regular season rating to its postseason ratings (the only teams that probably have enough data for this are the Steelers and the Patriots). Fix my shitty code to run faster on larger amounts of data.

But before you do that, thereโs a problem with my current method. My more statistically-knowledgeable readers probably realized already that thereโs autocorrelation issue with incorporating margin of victory: the better teams will both win more often and win by larger margins more often, which will inflate our ratings over time. A solution to this would be to find a win update formula that weighs constraint of victory more when a lower-rated team wins and less when a higher-rated team wins. Iโll channel the stodgy professor in me and leave that as an exercise (hint: *use the natural log, Luke!*).

Hereโs the repository containing all the code used in this article. Otherwise, you can find me on Twitter at @AakashJapi, email at *aakashjapi at gmail dot com,* and Facebook at the above name. Feel free to contact me with ideas/thoughts/nfl memes/Patriots trash talk, or anything at all.