Table of contents :
Before going into more detail about retraining approaches of machine learning models, Let’s see the basic cycles of machine learning models.
Generally machine learning models will be trained by some learning between set of input features and dependent feature or target variable. The aim of the model is to minimize the prediction error by applying or optimizing cost functions, and when we found some optimized models, we will deploy into the production and the aim is that model will generate accurate predictions on future unseen data as well so the goal is that model will predict the future unseen data as accurately as data used during the training period.
Once machine learning models has been deployed to production, we assume that future unseen data will be similar like past data through which we trained our model. Moreover our assumptions are like the distribution of the data of all the features should remain constant, but this will not be the case always. Data distribution will change over time and our model should adapt those changes.
If we take an example of house price prediction, than prices of house will not remain same all the time. Data that has been used to train model which predict the prices of house some months ago will not give great prediction today. We need up to date information to train the models.
When developing any ML model, it is important to understand how data will change over time. There is a need of good architecture system or plan for maintaining our models updated.
Sometimes to know this types of changes or monitoring the model performance, there is a concept of Model Drift. Lets see what it is.
Model drift refers to several parameters that will change eventually. There are many cases which occur like predictive performance will degrade over time and it may occur due to some environmental changes and there is a need of determining these model drifts and it can be sort out using different Model Retraining Approaches.
When we say Model Drift, of course it’s not the model that is drifting. It’s the environment that is changing around the model.
There are different approaches for identifying model drifts and model retraining approaches but there is no one standard method for all kinds of problems. There will be different methods should be apply for different kind of problems.
3.1 PERFORMANCE DEGRADATION
To identify model drift is to explicitly determine that predictive performance has deteriorated and to quantify this decline.
Consider a financial forecasting model which will predict the next quarter’s revenue. The actual revenue won’t be observed until that quarter passes so we will not be able to see how well our model performed till that point. This will give kind of sense that at what rate model’s performance will degrade.
As Josh Wills points out, one of the most important things you can do before deploying a model is try to understand model drift in an offline environment.
Data scientists should seek to answer the question “If I train a model using this set of features on data from six months ago, and I apply it to data that I generated today, how much worse is the model than the one that I created untrained off of data from a month ago and applied to today?”
3.2 Distribution of training and Live data
Since there will be a degrade in model performance due to the distribution of data which deviate from the training data, comparing these will infer the model drift as we are expecting that this will occur. We can monitor those things via the range of values, histograms, whether they have null values or not ans so on.
3.3 CORRELATIONS BETWEEN FEATURES
There are assumptions in model building is that the relationship remain fixed. So we will monitor pairwise correlations between individual features.
As mentioned in What’s your ML Test Score? A rubric for ML production systems, you can do this by:
3.4 EXAMINING THE TARGET DISTRIBUTIONS
If distribution of dependent variable changes very frequently or significantly, model performance will deteriorate. The authors of Machine Learning: The High-Interest Credit Card of Technical Debt state that one simple and useful diagnostic is to track the target distribution.
Some times model retraining refer to finding new hyper parameter of an existing model architecture. Let it be clear that it is not about finding new parameters, finding new models are adding/updating features of the model. So we need to think about our problem such that we can reduce model drift on our productionize model.
Entire machine learning model building process goes through set of cycles from feature engineering to model selection to error estimation. Then the optimal model will be chosen and deploy into production.
As model drift refers to degradation of model performance due to some reasons like variation of data or change in environment, model retraining should not result in a different model generating process.
Rather retraining simply refers to re-running the process that generated the previously selected model on a new training set of data.
The features, model algorithm, and hyper-parameter search space should all remain the same. One way to think about this is that retraining doesn’t involve any code changes. It only involves changing the training data set.
Its not like should not change the parameters and add features, but if we do that then there is a need of entirely new model building process and that needs to be tested before deploying to production environment.
There is no standard rule that we should retrain model weekly or monthly or quarterly. It all depend on the use case and vary problem to problem.
If we are looking for student model that will predict the return of the students in next semester then there is no use to retrain model daily as there will be mode data semester wise, so we can retrain quarterly or as on new semester.
This is an example of a periodic retraining schedule. It’s often a good idea to start with this simple strategy but you’ll need to determine exactly how frequently you’ll need to retrain. Quickly changing training sets might require you to train as often as daily or weekly. Slower varying distributions might require monthly or annual retraining.
Apart from these, there are various approaches which we can follow like,
Continuous learning with Watson ML (IBM also providing watson ML cloud service for the same retraining purpose)
Previously published at https://akhil-kasare80.medium.com/retraining-machine-learning-models-how-important-it-is-e39e475bca9c