Years back, when Spotify was working on its recommendation engine, they faced challenges related to the quality of the data used for training ML algorithms.
Had they not decided to go back to the data preparation stage and invest additional effort in cleaning, normalizing, and transforming their data, chances are our listening experience wouldn’t be as enjoyable.
Thoroughly preparing data for machine learning allowed the streaming platform to train a powerful ML engine that accurately predicts users’ listening preferences and offers highly personalized music recommendations.
Spotify avoided a crucial mistake companies make when it comes to preparing data for machine learning — not investing enough effort or skipping the stage whatsoever.
Many businesses assume that feeding large volumes of data into an ML engine is enough to generate accurate predictions. The truth is it can result in a number of problems, for example, algorithmic bias or limited scalability.
The success of machine learning depends heavily on data.
And the sad given is: all data sets are flawed. That is why data preparation is crucial for machine learning. It helps rule out inaccuracies and bias inherent in raw data, so that the resulting ML model generates more reliable and accurate predictions.
In this blog post, we highlight the importance of preparing data for machine learning and share our approach to collecting, cleaning, and transforming data. So, if you’re new to ML and want to ensure your initiative turns out a success, keep reading.
The first step towards successfully adopting ML is clearly formulating your business problem. Not only does it ensure that the ML model you’re building is aligned with your business needs, but it also allows you to save time and money on preparing data that might not be relevant.
Additionally, a clear problem statement makes the ML model explainable (meaning users understand how it makes decisions). It’s especially important in sectors like healthcare and finance, where machine learning has a major impact on people’s lives.
With the business problem nailed down, it’s time to kick off the data work.
Overall, the process of preparing data for machine learning can be broken down into the following stages:
Let’s have a closer look at each.
Data preparation for machine learning starts with data collection. During the data collection stage, you gather data for training and tuning the future ML model. Doing so, keep in mind the type, volume, and quality of data: these factors will determine the best data preparation strategy.
Machine learning uses three types of data: structured, unstructured, and semi-structured.
The structure of the data determines the optimal approach to preparing data for machine learning. Structured data, for example, can be easily organized into tables and cleaned via deduplication, filling in missing values, or standardizing data formats.
In contrast, extracting relevant features from unstructured data requires more complex techniques, such as natural language processing or computer vision.
The optimal approach to data preparation for machine learning is also affected by the volume of training data. A large dataset may require sampling, which involves selecting a subset of the data to train the model due to computational limitations. A smaller one, in turn, may require data scientists to take additional steps to generate more data based on the existing data points (more on that below.)
The quality of collected data is crucial as well. Using inaccurate or biased data can affect ML output, which can have significant consequences, especially in such areas as finance, healthcare, and criminal justice. There are techniques that allow data to be corrected for error and bias. However, they may not work on a dataset that is inherently skewed.Once you know what makes “good” data, you must decide how to collect it and where to find it. There are several strategies for that:
Sometimes though, these strategies don’t yield enough data. You can compensate for the lack of data points with these techniques:
The next step to take to prepare data for machine learning is to clean it. Cleaning data involves finding and correcting errors, inconsistencies, and missing values. There are several approaches to doing that:
Handling missing data
Missing values is a common issue in machine learning. It can be handled by imputation (think: filling in missing values with predicted or estimated data), interpolation (deriving missing values from the surrounding data points), or deletion (simply removing rows or columns with missing values from a dataset.)
Handling outliers
Outliers are data points that significantly differ from the rest of the dataset. Outliers can occur due to measurement errors, data entry errors, or simply because they represent unusual or extreme observations. In a dataset of employee salaries, for example, an outlier may be an employee who earns significantly more or less than others. Outliers can be handled by removing, transforming them to reduce their impact, winsorizing (think: replacing extreme values with the nearest values that are within the normal range of distribution), or treating them as a separate class of data.
Removing duplicates
Another step in the process of preparing data for machine learning is removing duplicates. Duplicates don’t only skew ML predictions, but also waste storage space and increase processing time, especially in large datasets. To remove duplicates, data scientists resort to a variety of duplicate identification techniques (like exact matching, fuzzy matching, hashing, or record linkage). Once identified, they can be either dropped or merged. However, in unbalanced datasets, duplicates can in fact be welcomed for achieving normal distribution.
Handling irrelevant data
Irrelevant data refers to the data that is not useful or applicable to solving the problem. Handling irrelevant data can help reduce noise and improve prediction accuracy. To identify irrelevant data, data teams resort to such techniques as principal component analysis, correlation analysis, or simply rely on their domain knowledge. Once identified, such data points are removed from the dataset.
Handling incorrect data
Data preparation for machine learning must also include handling incorrect and erroneous data. Common techniques of dealing with such data include data transformation (changing the data, so that it meets the set criteria) or removing incorrect data points altogether.
Handling imbalanced data
An imbalanced dataset is a dataset in which the number of data points in one class is significantly lower than the number of data points in another class. This can result in a biased model that is prioritizing the majority class, while ignoring the minority class. To deal with the issue, data teams may resort to such techniques as resampling (either oversampling the minority class or undersampling the majority class to balance the distribution of data), synthetic data generation (generating additional data points for the minority class synthetically), cost-sensitive learning (assigning higher weight to the minority class during training), ensemble learning (combining multiple models trained on different data subsets using different algorithms), and others.
These activities help ensure that the training data is accurate, complete, and consistent. Though a big achievement, it is not enough to produce a reliable ML model just yet. So, the next step on the journey of preparing data for machine learning involves making sure the data points in the training data set conform to specific rules and standards. And that stage in the data management process is referred to as data transformation.
During the data transformation stage, you convert raw data into a format suitable for machine learning algorithms. That, in turn, ensures higher algorithmic performance and accuracy.
Our experts in preparing data for machine learning name the following common data transformation techniques:
Scaling
In a dataset, different features may use different units of measurement. For example, a real estate dataset may include the information about the number of rooms in each property (ranging from one to ten) and the price (ranging from $50,000 to $1,000,000). Without scaling, it is challenging to balance the importance of both features. The algorithm might give too much importance to the feature with larger values — in this case, the price — and not enough to the feature with seemingly smaller values. Scaling helps solve this problem by transforming all data points in a way that makes them fit a specified range, typically, between 0 and 1. Now you can compare different variables on equal footing.
Normalization
Another technique used in data preparation for machine learning is normalization. It is similar to scaling. However, while scaling changes the range of a dataset, normalization changes its distribution.
Encoding
Categorical data has a limited number of values, for example, colors, car models, or animal species. Because machine learning algorithms typically work with numerical data, categorical data must be encoded in order to be used as an input. So, encoding stands for converting categorical data into a numerical format. There are several encoding techniques to choose from, including one-hot encoding, ordinal encoding, and label encoding.
Discretization
Discretization is an approach to preparing data for machine learning that allows transforming continuous variables, such as time, temperature, or weight, into discrete ones. Consider a dataset that contains information about people’s height. The height of each person can be measured as a continuous variable in feet or centimeters. However, for certain ML algorithms, it might be necessary to discretize this data into categories, say, “short”, “medium”, and “tall”. This is exactly what discretization does. It helps simplify the training dataset and reduce the complexity of the problem. Common approaches to discretization span clustering-based and decision-tree-based discretization.
Dimensionality reduction
Dimensionality reduction stands for limiting the number of features or variables in a dataset and only preserving the information relevant for solving the problem. Consider a dataset containing information on customers’ purchase history. It features the date of purchase, the item bought, the price of the item, and the location where the purchase took place. Reducing the dimensionality of this dataset, we omit all but the most important features, say, the item purchased and its price. Dimensionality reduction can be done with a variety of techniques, some of them being principal component analysis, linear discriminant analysis, and t-distributed stochastic neighbor embedding.
Log transformation
Another way of preparing data for machine learning, log transformation, refers to applying a logarithmic function to the values of a variable in a dataset. It is often used when the training data is highly skewed or has a large range of values. Applying a logarithmic function can help make the distribution of data more symmetric.
Speaking of data transformation, we should mention feature engineering, too. While it is a form of data transformation, it is more than a technique or a step in the process of preparing data for machine learning. It stands for selecting, transforming, and creating features in a dataset. Feature engineering involves a combination of statistical, mathematical, and computational techniques, including the use of ML models, to create features that capture the most relevant information in the data.
It is usually an iterative process that requires testing and evaluating different techniques and feature combinations in order to come up with the best approach to solving a problem.
The next step in the process of preparing data for machine learning involves dividing all gathered data into subsets — the process known as data splitting. Typically, the data is broken down into a training, validation, and testing dataset.
By splitting the data, we can assess how well a machine learning model performs on data it hasn’t seen before. With no splitting, chances are the model would perform poorly on new data. This can happen because the model may have just memorized the data points instead of learning patterns and generalizing them to new data.
There are several approaches to data splitting, and the choice of the optimal one depends on the problem being solved and the properties of the dataset. Our experts in preparing data for machine learning say that it often requires some experimentation from the data team to determine the most effective splitting strategy. The following are the most common ones:
Properly preparing data for machine learning is essential to developing accurate and reliable machine learning solutions. At ITRex, we understand the challenges of data preparation and the importance of having a quality dataset for a successful machine learning process.
If you want to maximize the potential of your data through machine learning, contact ITRex team. Our experts will provide assistance in collecting, cleaning, and transforming your data.
Also published here.