Feature selection, also known as variable selection, attribute selection, or predictor selection (rarely, generalization), is a type of abstraction, the process of selecting a subset of significant features (both dependent and independent variables) for model construction.
Feature selection is used for four reasons:
Simplifying the model to enhance interpretability
To reduce training time
To avoid the curse of dimensionality
To improve the model's generalization ability and combat overfitting
The curse of dimensionality is a problem associated with the exponential increase in the amount of data due to an increase in the dimensionality of space. This leads to the following difficulties:
The computational laboriousness
The need to store a huge amount of data
Increase in the proportion of noise
In linear classifiers, an increase in the number of features leads to problems of multicollinearity and overfitting.
In metric classifiers, distances are usually calculated as the average modulus of differences across all features. According to the Law of Large Numbers, the sum of n addends tends to a certain fixed limit as n→∞. Thus, distances in all pairs of objects tend to have the same value and, therefore, become uninformative.
The main idea behind using the feature selection technique is that data contain certain features, and if some of them are redundant or insignificant, they can be removed without significant loss of information. "Redundant" and "insignificant" are two different concepts, as one significant feature may be redundant in the presence of another significant feature with which it is strongly correlated.
Feature selection methods are usually divided into three classes based on how they combine selection algorithms and model construction.
Next, we will discuss a little theory about each of these methods and then look at more specific examples for each of them.
These methods use an indirect indicator instead of an error metric to evaluate a subset of features. This indicator is chosen so that it can be easily calculated while retaining the usefulness of the feature set. Commonly applied metrics include mutual information, Pearson's mixed moment correlation coefficient, distance between/within classes, or significance criteria results for each class/feature combination.
Filtering is usually computationally less expensive than wrapping, but at the same time, filtering yields feature sets that are not tuned to a specific type of predictive model. This disadvantage means that the feature set obtained from filtering is more general than the set obtained from wrapper methods and completely unrelated to future models, leading to a lower generalization ability of the model than with wrapper methods, as features are selected without the involvement of any models. However, the feature set obtained after filtering does not contain assumptions about the predictive model and, therefore, is more suitable for detecting relationships between features.
Many filtering techniques provide feature ranking, not giving a clearly better subset, and the cutoff point in ranking is chosen using cross-validation.
Filter methods are also used as preliminary processing steps for wrapper methods, allowing filtering to be applied to larger tasks. Filter methods are very time-efficient in computation and resistant to overfitting.
Wrapper methods for feature selection use a model of prior performance assessment to rank subsets of features. Each new subset is used to train a model, which is then tested on a control sample. The metric calculated on this control sample gives an evaluation for the given subset.
The two main disadvantages of these methods are:
Increased risk of overfitting when the number of observations is insufficient.
Significant computation time when the number of variables is large.
Since wrapper methods involve iterating over all subsets of features with subsequent model training, they are the most computationally expensive but generally yield the best feature set for a specific model.
Embedded methods are a generalizing group of techniques that carry out feature selection as part of the model construction process. The training algorithm has the advantage of its own variable selection process and performs feature selection and classification simultaneously. We will discuss this method in more detail later with examples.
In the future, to practice code examples, an experimental dataset taken from the extremely popular task of binary classification for predicting the survivors of the Titanic (https://www.kaggle.com/competitions/titanic/data) will be used.
Filter methods are based on statistical methods and typically consider each feature independently. They allow us to assess and rank features by significance, which is taken as the degree of their correlation with the target variable. Let's look at some examples.
The method is closely related to the concept of information entropy. The formula for entropy is quite simple:
To better understand the meaning of this measure, we can consider two simple examples. First, the tossing of a coin where the probability of heads and tails is equal. In this case, the entropy calculated by the formula will be 1. However, if the coin always lands heads up, then the entropy will be 0. In other words, entropy close to 1 indicates a uniform distribution, and close to 0 indicates something more interesting.
To calculate the correlation between variables, we need to define a few more measures. The first of these is the partial conditional entropy:
Conditional entropy is calculated as:
What's interesting is not the measure itself but its difference from the usual entropy of the feature Y. That is a measure of how much more orderly the variable Y becomes for us if we know the values of X. Or, put simply, whether there is a correlation between the values of X and Y and how strong it is. This is indicated by the value - mutual information (mutual information):
The larger the IG parameter, the stronger the correlation. Thus, we can easily calculate the mutual information indicator for all features and remove those that weakly affect the target variable. Thereby, first, reducing the time to compute the model and, second, reducing the risk of overfitting.
So, to test it out, we will write our own function make_mi_scores
using the mutual_info_classif
from the sklearn
library, which calculates the mutual information score for features and also adds the function plot_scores
to display the mutual information.
from sklearn.feature_selection import mutual_info_classif
def make_mi_scores(X, y, discrete_features):
mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
mi_scores = make_mi_scores(train, train_y, discrete_features='auto')
mi_scores[::3] # show a few features with their MI scores
def plot_scores(scores,name):
scores = scores.sort_values(ascending=True)
width = np.arange(len(scores))
ticks = list(scores.index)
plt.barh(width, scores)
plt.yticks(width, ticks)
plt.title(name)
plt.figure(dpi=100, figsize=(8, 8))
plt.rcParams.update({'font.size': 9})
plot_scores(mi_scores, "Mutual Information Scores")
And the output is:
The Pearson correlation coefficient characterizes the existence of a linear relationship between two quantities.
The Pearson correlation coefficient can be used to determine the strength of the linear relationship between quantities (other types of relationships are identified using regression analysis methods).
It is necessary to understand the difference between the concepts of "independence" and "uncorrelatedness.” The former implies the latter, but not vice versa. For example, the correlation between age and height in children is quite obvious in terms of cause-and-effect, but the correlation between mood and people's health is less obvious. Does improving mood lead to better health, or does good health lead to a good mood, or both? Or is there another underlying factor for both? In other words, correlation can be seen as evidence of a possible causal connection but cannot indicate what the causal connection might be if there is one.
Correlation is usually calculated:
Between each feature and the target variable to determine the importance of each feature in describing the behavior of the target variable.
Pairwise between all features to identify similar features within the space.
The Pearson coefficient belongs to the number of parametric statistical criteria. This means that it can only be used if the parameters we are correlating satisfy certain conditions, the main one being the normality of distribution.
So the code example will look like this:
train_pearson_corr_with_target = (
train.corrwith(train_y["Survived"], method="pearson")
.to_frame()
.rename(columns={0: "corr"})
.join(
train.corrwith(train_y["Survived"], method="pearson")
.to_frame()
.abs()
.rename(columns={0: "abs_corr"})
)
).sort_values("abs_corr",ascending=False)
train_pearson_corr_with_target.head()
plt.figure(dpi=100, figsize=(8, 8))
plt.rcParams.update({'font.size': 8})
plot_scores(train_pearson_corr_with_target.abs_corr,'Pearson Correlation')
The output:
In the graph above, we can observe the distribution of correlation values with the target variable across different features.
It is noticeable that features with high mutual information scores also have high Pearson correlation coefficients.
The Spearman correlation coefficient is a measure of the linear relationship between random variables. Spearman's correlation is rank-based, meaning that instead of numerical values, the corresponding ranks are used for assessing correlation. The rank of an observation, in this case, is its ordinal number in the variational series, i.e., instead of the actual values of the feature, the number in the sequence of feature values arranged in non-decreasing order is used. In other words, the ordinal number after sorting in ascending order.
The coefficient is invariant to any monotonic transformation of the measurement scale. Therefore, the parameters between which the correlation is calculated do not have to be normally distributed.
Data often do not have a normal distribution, and in this case, strictly speaking, the Pearson coefficient cannot be used. In such cases, Spearman's coefficients should be used.
Thus, Spearman's method is more universal and less stringent than Pearson's. Pearson's accuracy is slightly higher, but this is not fundamentally important in the context of machine learning tasks.
So, the code example will look quite the same as Pearson’s one:
train_spearman_corr_with_target = (
train.corrwith(train_y["Survived"], method="spearman")
.to_frame()
.rename(columns={0: "corr"})
.join(
train.corrwith(train_y["Survived"], method="spearman")
.to_frame()
.abs()
.rename(columns={0: "abs_corr"})
)
).sort_values("abs_corr",ascending=False)
train_spearman_corr_with_target.head()
plt.figure(dpi=100, figsize=(8, 8))
plt.rcParams.update({'font.size': 8})
plot_scores(train_pearson_corr_with_target.abs_corr,'Pearson Correlation')
The output:
According to the fundamental differences, there are several key differences and applications between them:
In machine learning, Pearson's correlation is useful when the data is normally distributed, and the relationship between variables is expected to be linear.
Spearman's correlation is more versatile for machine learning tasks, especially when the data does not follow a normal distribution or when the relationship between variables is not strictly linear.
To sum up, the choice between Pearson and Spearman correlation coefficients depends largely on the nature of the data and the specific requirements of the analysis. For machine learning applications, Spearman's correlation tends to be more universally applicable, particularly when dealing with real-world data that may not adhere to the assumptions of normality and linearity required by Pearson's correlation.
In addition to filtering by correlation values with the target variable, it is extremely useful to check the correlation of features among themselves to identify redundant data. To do this, we will calculate the correlations between each feature using the corr()
function (without specifying additional parameters, Pearson's correlation will be calculated) and display it in a heatmap.
import seaborn as sns
corr = train.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr, cmap="coolwarm")
plt.show()
As we can see from the heatmap, we have features with quite high absolute values of correlation. Therefore, from pairs with high values, we can select only one feature.
In this introductory article of our series on feature selection, we've begun exploring the crucial concept of selecting the most relevant features for efficient model construction in machine learning. Feature selection is pivotal for simplifying models, reducing computation time, and enhancing model performance by avoiding the curse of dimensionality.
We touched upon various feature selection approaches, including filter, wrapper, and embedded methods, each offering unique advantages in data analysis. Specifically, filter methods stand out for their cost-effective way of filtering out large amounts of features. They efficiently identify key relationships between features without being tailored to any specific model, making them a versatile and essential tool in the initial stages of feature selection.
As we continue this series, we will delve deeper into each approach, unpacking its intricacies and applications. Our focus will be on providing concise, practical insights into the art and science of feature selection, ensuring that readers are well-equipped to handle the challenges of model optimization in machine learning. Stay tuned for more in-depth exploration and guidance in the upcoming articles of this series.