Which sports geek wouldn’t like to create their own system for predicting matches, be it if you want to bet or just out of intellectual curiosity ?Nowadays, advanced statistics are available on websites like basketball-reference and awesome machine learning libraries can be used for every programming language. This is not going to be a comprehensive DIY kind of guide, I’m just going to talk about what I found when playing with this for a few months and share some code that will be very useful for the kickoff.
Machine Learning works by building models that capture weights and relationships between features from historical data and then use these models for predicting future outcomes. So, you need to understand the sport, think which variables are representative of future performance, build a database that contains this information and run Machine Learning algorithms on historical data to analytically assign weights to these variables.
I spent quite a long time building an NBA and NCAA scraper which downloads full seasons (match by match) from basketball-reference. All the problems you may stumble upon as regards relational databases are solved in my scraper and you are guaranteed to uniquely associate information.My scraper models matches in a sophisticated json format that captures the advanced stuff that takes place in a basketball game.For representing this information in your own database you will need to define a schema and insert the information. I used SQLAlchemy to write models that can be used to create the database and build an analytical system. It’s all available on my github repo.
Scikit-Learn is the way to go for building Machine Learning systems in Python. You will need to figure out which attributes work best for predicting future matches based on historical performance. As said before, understanding the sport allows you to choose more advanced metrics like Dean Oliver’s four factors. These, combined with other human analysis (like Vegas lines for example) work best.
If you build your own machine learning models you will find that you can correctly predict winners at a rate of around 70%. Not enough though to win money through betting, but still better than Espn experts and a lot of academic papers. You will also learn a lot about the sport, databases, machine learning and Python.