As a decently ranked League of Legends player, I’ve always wondered about the importance of dodging in ranked games. If you see that your Riven top is on a five game lose streak, or you get a first time Taliyah mid, should you risk sticking around and playing it out, or should you dodge to save yourself the trouble? I started this project to finally answer that question, and get a better sense of the factors that affect a ranked game.
For this project, I worked with 1700 sample matches. Each match was treated as two data points: one for the winning team, and one for the losing team. Through Riot and Champion.gg’s public APIs, I generated 14 features for each player on a team, for 70 features total per team. The first 5 features pertain to the general champion stats picked by the player, while the other 9 are specific to the player’s personal stats. The features were:
Whenever there was an off-meta champion pick, some of the champion features would be NaN values, as Champion.gg did not have enough data to have accurate stats for the champion. To account for this, I used two NaN replacement schemes :
For this project, I compared the effectiveness of deep learning, random forests, support vector machines, gradient boost, and logistic regression in predicting win/losses. Splitting my data into 70% training and 30% test, I used grid search to discover the best hyperparameters for each model. These were my results using scheme one of averaging NaN values:
These were my results using scheme two of removing NaN values:
While SVM, neural nets, and logistic regression improves massively from training on non-NaN samples, gradient boost and random forests see almost no difference in effectiveness. Even with the increase in accuracy, the other models’ performance does not match that of gradient boost and random forests. It is also interesting to note that each model performed better on identifying wins rather than losses. As I trained on a dataset with 50% wins and 50% losses, the models may have a higher tolerance to predicting wins.
To see if I had an adequate amount of data, I found each model’s accuracy on increasingly sized subsamples of my overall dataset:
All of the models are plateauing in performance around 2000 samples, so my dataset size of 3400 samples is sizable enough to get an accurate representation of the overall sample space. Furthermore, random forests and gradient boosting models outperform the other models for both NaN replacement schemes, and as such I will be using the two aforementioned models in determining feature importance. It is interesting to note the performance improvement of neural net, SVM, and logistics regression models on samples not containing NaN. Random forests and gradient boosting performed around the same for both schemes, however. Thus, for determining feature importance, I will average NaN values, as this provides more data to work with. All NaN values pertained to champion-specific features, so perhaps the three models with improvement are much more sensitive for those features than random forests and gradient boosting.
To see the importance of certain subsets of features, I trained the random forests and gradient boosting classifiers on only each subset’s features. I ran twenty samples for each subset, with ten for each model. These were the results:
The models seem to work best when trained only on specific player information, doing significantly better than if trained only on champion specific information. This implies that overall champion strength in the meta is not as important to a match as a player’s experience playing on said champion. Furthermore, training on the top role only has the most variation in performance, whereas training on the ad carry role only has significantly worst performance. This implies that the ad carry role has much less impact on a game as the other roles, with top lane having the most variation in effectiveness.
Almost all of the matches were gathered in the last week of preseason. People may have been taking ranked games less seriously as a result, which could contribute to a loss of accuracy. Also, since Riot’s API doesn’t have a list of recent matches played, I had to crawl match data, starting from my own account as a seed. This resulted in an uneven distribution of data by ranks.
From table 5, we can see that the majority of the data is from platinum and gold, with almost no data on bronze, master, and challenger. Because of this, the conclusions of this project are limited to the aforementioned ranks.
I was pretty satisfied with my results, as they more or less met my expectations. Even with tuned hyperparameters, my best model had an accuracy of only 60% in predicting wins/losses. This shows that games aren’t decided in champion select, as a good deal of randomness happens during the match. However, there may be cases where my model predicts a very high percentage(>90) of a loss, and dodging those games may be wise.
I also found that the ADC role has the lowest impact on the game (I’m an ADC main, oh well), whereas all other roles have a somewhat equal impact. Champion picks are much less important than the player that picks them, so don’t think your mid laner is trolling when they pick Hecarim mid.
The github for this project: https://github.com/arilato/ranked_prediction
Riot’s public API: https://developer.riotgames.com
Champion.gg’s public API: http://api.champion.gg