We often get blocked at different steps while working on a machine learning problem. In order to solve almost all these steps, I have listed down all the major challenges we face and steps we can take to overcome those. I have also categorised these challenges into different sub domain for easier understanding namely Data Preparation, Model Training and Model Deployment.
Data Preparation
Data collection:
- Getting incomplete data is usually a headache sometimes when we start collecting data. Even when we get data, it turns out to be bias data. Bias is any deviation from the truth in data collection or data analysis that can cause false conclusion.
- Then comes the curse of dimensionality which refers to the phenomena that occur when analyzing high dimensional data that does not occur in low dimensional spaces.
- Finally we have data sparsity problem. Imagine that you have a table with lots of null or impossible values. These values represents the sparsity in your data.
Steps to overcome:
- Dedicate proper time to understand the problem and the proper datasets you need to solve the problem
- Enrich the data
- Dimension-reduction techniques
Outliers:
- Out of range numerical values or unkown categorical value in our data
- It shows drastic influence on squared loss functions
Steps to overcome:
- Discretization techniques like binning can help in reducing the squared loss functions
- Robust methods like Huber loss functions
Missing Data:
- This affects in information loss and therefore affects the model’s accuracy
- Information bias which happens when key information is either measured, collected, or interpreted inaccurately
Steps to overcome:
- Tree based modelling techniques can help in dealing with such problem
- Discretization can also help here in reducing the loss function
- Imputation
Sparse target variables:
- It happens when there is a low primary event occurence rate
- Overwhelming preponderance of zero or missing values in target
Steps to overcome:
- Proportional oversampling
- Mixture models
Model Training
Overfitting:
- Main reason behind overfitting is high variance and low bias that fails to generalize properly
Steps to overcome:
- Regularization - It is a technique used for tuning the function by adding an additional penalty term in the error function
- Noise Injection - This method refers to adding "noise" artificially to the input data during the training process
- Cross validation - It is a technique that is used for the assessment of how the results of statistical analysis generalize to an independent data set
Computational resource exploitation:
- Most of the times, we perform single threaded algorithm implementation
- Heavy reliability on interpreted languages
Steps to overcome:
- Train many single threaded models in parallel
- Hardware acceleration for example GPU and SSD
- Low level native libraries
- Cloud - Google colab notebooks
Ensemble models:
- Single model sometimes fails to provide adequate accuracy
- Single model also leads to overfitting - high variance and low bias that fails to generalize properly
Steps to overcome:
- Ensemble models like bagging, boosting and stacking can help overcome the problem
- Custom or manual combination of prediction sometime help in achieving better accuracy
Hyper parameter tuning:
- Combinatorial explosion which is a rapid growth of the complexity of a problem due to how the combinatorics of the problem is affected by the input, happens with hyper parameter in conventional algorithms.
Steps to overcome:
- Local search optimization which also includes genetic algorithm
- Grid search or rand search techniques help in finding the best pair of hyper parameter from the ones we feed.
Model Interpretation:
- Large number of parameters and rules makes it difficult to interpret the model
Steps to overcome:
- Variable selection by using regularization techniques
- Surrogate models
- Interpretation methods like LIME
- Partial dependency plots, feature importance graphs can assist in interpreting the models
Model Deployment
Model deployment:
- Trained model logic must be used from developing environment to a operational computing system to assist an organization in making decision
Steps to overcome:
- Web-service scoring can help people in getting the results
- Dashboards of the models ouput is easier for any organization to understand
Model decay:
- From the time since the model was created, business problem and market conditions might change
- New observation fall out of the domain of training data
Steps to overcome:
- Regular monitoring of model especially when the accuracy decreases
- Update model regularly whenever there are changes in the data or system affecting the model
Thanks for reading till the end and hope you like it !
Previously published at https://medium.com/@siddhesh_jadhav/how-to-deal-with-major-challenges-in-machine-learning-1fc7e719bd0b