paint-brush
An Attempt to Predict the NBA with a Machine Learning System Written in Python Part IIby@frgoitia
11,450 reads
11,450 reads

An Attempt to Predict the NBA with a Machine Learning System Written in Python Part II

by Francisco GoitiaJuly 10th, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

I now want to talk about the <a href="https://hackernoon.com/tagged/model" target="_blank">model</a> I discussed in the <a href="https://medium.com/@frgoitia/how-to-create-your-own-machine-learning-predictive-system-in-the-nba-using-python-7189d964a371" target="_blank">first piece</a> in more <a href="https://hackernoon.com/tagged/technical" target="_blank">technical</a> terms. Better a year late than never, I suppose.
featured image - An Attempt to Predict the NBA with a Machine Learning System Written in Python Part II
Francisco Goitia HackerNoon profile picture

I now want to talk about the model I discussed in the first piece in more technical terms. Better a year late than never, I suppose.

For predicting the outcome of a match I used a logistic regression model. I compared it against models based on naive bayes, neural networks, random forest and support vector machines. Every model was cross-validated and their optimal hyperparameters were found.’

The reason I sticked with a logistic regression model was that it had a prediction accuracy on par, or superior, than more complex solutions and the transparency of the model means you can use it for qualitative analysis. With logistic regression you understand what are the key features and their weight. Also, logistic regression returns probabilities that are pretty accurate and this is important to have a notion of how confident you are in your prediction.

Features

The model consists of the following features with their coefficients. Features were standarized before fitting the model:

  • home court advantage: 0.10218887
  • effective field goal percentage difference: 0.16118265
  • turnover percentage difference: -0.05958713
  • offensive rebound percentage difference: 0.07061777
  • free throws to field goals attempts difference: 0.03267933
  • distance traveled in last 7 days difference: -0.01459163
  • form in last seven matches difference: 0.0828436
  • offensive rating difference: 0.17885523
  • defensive rating difference: -0.33924331
  • effective field goal percentage difference (court*): 0.10808104
  • turnover percentage difference (court*): -0.09548481
  • offensive rebound percentage difference (court*): 0.07055131
  • free throws to field goals attempts difference (court*): 0.0748545
  • form in last seven matches difference (court*): -0.00486437
  • offensive rating difference (court*): 0.14822224
  • defensive rating difference (court*): -0.21756487

*Considering court situation means that, for example if Team A is the host and Team B is the visitor the effective field goal percentage would be: A effective field goal percentage when playing at home — B effective field goal percentage on the road)

The input of a given match would be the difference in each of these metrics between the two teams.

Performance

Let’s take the last Celtics ring season as an example: 2007-2008. This model would have correctly predicted 70 % of the matches.

Is this number good ? We would obviously expect a dummy model that chooses winners randomly to be correct around 50 % of the time. However, we have a better benchmark at our disposal: Vegas money lines. A model that simply predicts that Vegas’ favorite would win, would have been correct 69.8 % of the time. Considering this is what bookies do for a living, spending a lot of resources building models for setting the odds, and leveraging the power of markets, I’d argue that having a similar performance than Vegas is a great result.

Interestingly, 87.6% of the time our model picked the Vegas’ favorite, while 12.4 % it picked the underdog. 51 % of the time, it correctly predict that the underdog would in.