paint-brush
How to Deal With Major Challenges in Machine Learningby@siddhesh
211 reads

How to Deal With Major Challenges in Machine Learning

by SiddheshMarch 14th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We often get blocked at different steps while working on a machine learning problem. In order to solve almost all these steps, I have listed down all the major challenges we face and steps we can take to overcome those. Data Preparation, Model Training and Model Deployment are different sub domains. I have also categorised these challenges into different sub domain for easier understanding namely Data Preparedation, Models Deployment and Models Overfitting, Data Sparsity, Data Collection, Data Analysis and Data Interpretation.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How to Deal With Major Challenges in Machine Learning
Siddhesh HackerNoon profile picture

We often get blocked at different steps while working on a machine learning problem. In order to solve almost all these steps, I have listed down all the major challenges we face and steps we can take to overcome those. I have also categorised these challenges into different sub domain for easier understanding namely Data Preparation, Model Training and Model Deployment.

Data Preparation

Data collection: 

  1. Getting incomplete data is usually a headache sometimes when we start collecting data. Even when we get data, it turns out to be bias data. Bias is any deviation from the truth in data collection or data analysis that can cause false conclusion.
  2. Then comes the curse of dimensionality which refers to the phenomena that occur when analyzing high dimensional data that does not occur in low dimensional spaces.
  3. Finally we have data sparsity problem. Imagine that you have a table with lots of null or impossible values. These values represents the sparsity in your data.

Steps to overcome:

  1. Dedicate proper time to understand the problem and the proper datasets you need to solve the problem
  2. Enrich the data
  3. Dimension-reduction techniques

Outliers:

  1. Out of range numerical values or unkown categorical value in our data
  2. It shows drastic influence on squared loss functions

Steps to overcome:

  1. Discretization techniques like binning can help in reducing the squared loss functions
  2. Robust methods like Huber loss functions

Missing Data:

  1. This affects in information loss and therefore affects the model’s accuracy
  2. Information bias which happens when key information is either measured, collected, or interpreted inaccurately

Steps to overcome:

  1. Tree based modelling techniques can help in dealing with such problem
  2. Discretization can also help here in reducing the loss function
  3. Imputation

Sparse target variables:

  1. It happens when there is a low primary event occurence rate
  2. Overwhelming preponderance of zero or missing values in target

Steps to overcome:

  1. Proportional oversampling
  2. Mixture models

Model Training

Overfitting:

  1. Main reason behind overfitting is high variance and low bias that fails to generalize properly

Steps to overcome:

  1. Regularization - It is a technique used for tuning the function by adding an additional penalty term in the error function
  2. Noise Injection - This method refers to adding "noise" artificially to the input data during the training process
  3. Cross validation - It is a technique that is used for the assessment of how the results of statistical analysis generalize to an independent data set

Computational resource exploitation:

  1. Most of the times, we perform single threaded algorithm implementation
  2. Heavy reliability on interpreted languages

Steps to overcome:

  1. Train many single threaded models in parallel
  2. Hardware acceleration for example GPU and SSD
  3. Low level native libraries
  4. Cloud - Google colab notebooks

Ensemble models:

  1. Single model sometimes fails to provide adequate accuracy
  2. Single model also leads to overfitting - high variance and low bias that fails to generalize properly

Steps to overcome:

  1. Ensemble models like bagging, boosting and stacking can help overcome the problem
  2. Custom or manual combination of prediction sometime help in achieving better accuracy

Hyper parameter tuning:

  1. Combinatorial explosion which is a rapid growth of the complexity of a problem due to how the combinatorics of the problem is affected by the input, happens with hyper parameter in conventional algorithms.

Steps to overcome:

  1. Local search optimization which also includes genetic algorithm
  2. Grid search or rand search techniques help in finding the best pair of hyper parameter from the ones we feed.

Model Interpretation:

  1. Large number of parameters and rules makes it difficult to interpret the model

Steps to overcome:

  1. Variable selection by using regularization techniques
  2. Surrogate models
  3. Interpretation methods like LIME
  4. Partial dependency plots, feature importance graphs can assist in interpreting the models

Model Deployment

Model deployment:

  1. Trained model logic must be used from developing environment to a operational computing system to assist an organization in making decision

Steps to overcome:

  1. Web-service scoring can help people in getting the results
  2. Dashboards of the models ouput is easier for any organization to understand

Model decay:

  1. From the time since the model was created, business problem and market conditions might change
  2. New observation fall out of the domain of training data

Steps to overcome:

  1. Regular monitoring of model especially when the accuracy decreases
  2. Update model regularly whenever there are changes in the data or system affecting the model

  3. Thanks for reading till the end and hope you like it !

Previously published at https://medium.com/@siddhesh_jadhav/how-to-deal-with-major-challenges-in-machine-learning-1fc7e719bd0b