904 reads

Scikit Learn 1.0: New Features in Python Machine Learning Library

by Davis DavidNovember 6th, 2021

Too Long; Didn't Read

The scikit-learn is the most popular open-source and free python machine learning library for Data scientists and Machine learning practitioners. It contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction. In this article, I’m happy to share with you the top 5 new features presented in the new version of scikkit-learn (1.0) New Flexible Plotting API includes metrics.PrecisionRecallDisplay, metrics.DetCurveDisplay, and inspection.PartialDependenceDisplay. Pearson’s R Correlation Coefficient is a new feature in feature selection.

Companies Mentioned

featured image - Scikit Learn 1.0: New Features in Python Machine Learning Library

Scikit-learn is the most popular open-source and free python machine learning library for Data scientists and Machine learning practitioners. The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction.

In this article, I’m happy to share with you the top 5 new features presented in the new version of scikit-learn (1.0).

Install Scikit-learn v1.0
New Flexible Plotting API
Feature Names Support
Pearson’s R Correlation Coefficient
OneHot Encoder Improvements
Histogram-based Gradient Boosting Models are now stable

Install Scikit-learn v1.0

Firstly, make sure you install the latest version (with pip):

pip install --upgrade scikit-learn

If you are using conda, use the following command:

conda install -c conda-forge scikit-learn

Note: Version 1.0.0 of scikit-learn requires python 3.7+, NumPy 1.14.6+ and scipy 1.1.0+. Optional minimal dependency is matplotlib 2.2.2+

Now, let’s look at the new features!

1. New Flexible Plotting API

Scikit-learn 1.0 has introduced new flexible plotting API such as metrics.PrecisionRecallDisplay, metrics.DetCurveDisplay, and inspection.PartialDependenceDisplay.

This Plotting API comes with two class methods:

(a) from_estimator()

This class method allows you to fit a model and plot the results at the same time.

Let's look at an example by using PrecisionRecallDisplay to visualize Precision and Recall.

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
 
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
                                                    
classifier= RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
 
disp_confusion = PrecisionRecallDisplay.from_estimator(classifier,
                                                       X_test,
                                                       y_test) 
                                    
                                                    
plt.show()

(b) from_predicitons()

In this class method, you can just pass prediction results and get your plots.

Let's look at an example by using ConfusionMatrixDisplay to visualize the confusion matrix.

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
 
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
                                                    
classifier= RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
 
predictions = classifier.predict(X_test)
    
disp_confusion = ConfusionMatrixDisplay.from_predictions(predictions,
                                                       y_test,
                               display_labels=classifier.classes_) 
                                    
                                                    
plt.show()

2. Feature Names Support (Pandas Dataframe)

In the new version of scikit-learn, you can track the names of the columns of your pandas dataframe when working with transformers or estimators.

When you pass a dataframe to an estimator and call the fit method, the estimator will store the features name in the feature_names_in_ attribute.

from sklearn.preprocessing import StandardScaler
import pandas as pd
 
X = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["age", "days", "duration"])
scalar = StandardScaler().fit(X)
 
print(scalar.feature_names_in_)

array(['age', 'days', 'duration'], dtype=object)

Note: feature names support is only enabled when the column names in the dataframe are all strings.

3. Pearson’s R Correlation Coefficient

This is a new feature in feature selection that can measure the linear relationship between each feature and the target for the regression tasks. It is also known as the pearson’s r.

The cross-correlation between each regressor and the target is computed as

((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)).

Note: Where X is the features of the dataset and y is the target variable.

The following example shows how you can compute the Pearson’s r for each feature and the target.

from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import r_regression 
 
X, y = fetch_california_housing(return_X_y=True)
 
print(X.shape)
 
p = r_regression(X,y) 
 
print(p)

(20640, 8)

[ 0.68807521 0.10562341 0.15194829 -0.04670051 -0.02464968 -0.02373741 -0.14416028 -0.04596662]

4. OneHot Encoder Improvements

The OneHot Encoder in scikit-learn 1.0 can accept values it has not seen before. You just need to set a parameter called handle_unknown to 'ignore' (handle_unknown='ignore') when instantiating the transformer.

When you transform data with an unknown category, the encoded columns for this feature will be all zero values.

In the following example, we pass an unknown category when we transform the data given.

from sklearn.preprocessing import OneHotEncoder
 
enc = OneHotEncoder(handle_unknown='ignore')
 
X = [['secondary'], ['primary'], ['primary']]
 
enc.fit(X)
 
transformed = enc.transform([['degree'], ['primary'],['secondary']]).toarray()
 
print(transformed)

[[0. 0.]
[1. 0.]
[0. 1.]]

Note: In the inverse transform, an unknown category will be labeled as None.

5. Histogram-based Gradient Boosting Models are now Stable

The two supervised learning algorithms introduced in the previous version of scikit-learn 0.24 (HistGradientBoostingRegressor and HistGradientBoostingClassifier) are no longer experimental and you can simply import and use them as:

from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor

There are more new features in scikit-learn 1.0.0 that I did not mention in this article. You can find the highlights of other features released in scikit-learn 1.0.0 here.

Congratulations, you have made it to the end of this article! I hope you have learned something new that will help you on your next machine learning project.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

And you can read more articles like this here.