Model Evaluation With Proper Scoring Rules: A No-Math Introduction

Proper scoring rules offer a model evaluation framework for probabilistic forecasts. Even if you don’t work with probabilistic predictions, understanding this concept can help you think about model evaluation. Every data scientist needs to evaluate models or , and I’m no different. As someone who went into data science without an undergraduate degree in statistics or engineering, my approach to model evaluation was lax. machine learning algorithms I enjoyed thinking about how to create a model, and model evaluation was often an afterthought. I just wanted to experiment with the structure of the model or try out different methods and approaches to my data. However, as I realized, model evaluation is an essential step in selecting a reliable model, especially if it’s to be used in practice. So how exactly can you evaluate models properly? Model evaluation If you’ve never worked with probabilistic forecasts, you might not know that the word proper has a specific meaning in model evaluation. We say that a metric, i.e. score is proper when it reflects the realistic probabilities of the event we are trying to predict according to a model. Let me explain. When we evaluate predictions in a model, we should look at the calibration and sharpness of the forecast. Calibration tells whether our predictions are statistically consistent with observed events or values. Sharpness captures the uncertainty in the predictions (predictive distribution) without considering the actual outcomes. In the math-speak, probabilistic forecasts want to “maximize the sharpness of the predictive distributions subject to calibration”. We want our model evaluation technique to be immune to ‘hedging’. Hedging your bets means betting on both sides of an argument or teams in a competition. In the world of probabilistic forecasting, hedging means avoiding the risk associated with more extreme values. If your metric or a score can be ‘hacked’ then you might have a problem. You know what I mean if you ever tried to fiddle with the class threshold to improve your classification accuracy. Proper Scoring Rule Example Brier score is a proper scoring rule for problems. Imagine you want to find the best model to recognize whether a photo shows a cat or a dog. Brier score ranges from 0 to 1, and the lower the score, the better the model. Intuitively, you might think that predicting a cat when the picture shows a cat is all there is to it, but it’s not that simple. classification You should consider uncertainty in your prediction on top of simple accuracy. If the photos you used to train your model are of low quality, such a model will be worse than a model built on high-resolution images. The point of proper scoring rules is to reflect this reality. The model based on the lower quality data probably can’t achieve 100% confidence in its predictions. Instead, it should be honest about the uncertainty and say that the probability of the photo showing a cat is 63%. Such a model would be well-calibrated. Proper Scoring Rules and Continuous Data When it comes to ordinal or continuous variables, a well-calibrated model will have a predictive distribution that covers the observed value. For example, I work with weekly counts of infection cases. Suppose my model predicts that we can expect between 10 and 20 cases. In that case, it’s pretty disappointing if the observed records tell us it was 25. On top of that, I need to also consider the sharpness of the predictions. In particular, your confidence interval should be helpful. I can achieve a perfect calibration score by predicting that we will see somewhere between 0 and 500 cases. Still, if the observed value was 25, a model expecting 10 to 30 cases would be better. Strictly proper scoring rules consider another aspect of sharpness related to hedging. Placing a higher probability on extreme values can be risky, but if the forecaster believes the value is the most likely to occur, that’s the value it should predict. Strictly proper scoring rules have only one unique maximum, so they won’t tolerate hedging. Libraries in R and Python Both R and Python offer libraries to calculate proper scoring rules. In R, you can install which give you access to Logarithmic score, Continuous Ranked Probability Score, and some more. scoringRules It’s worth noting that the focus of this package is to compare models not necessarily for forecast quality diagnostics. A library called in Python offers the above-mentioned Brier Score and Continuous Ranked Probability Score. You can try them out with the examples in the linked documentation. properscoring If you’re hungry for more, here are some useful resources: ODDACIOUS Blog T. GNEITING & A. E. RAFTERY (2007) Strictly Proper Scoring Rules, Prediction, and Estimation