Scikit-learn is the most popular open-source and free python machine learning library for Data scientists and Machine learning practitioners. The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction.
In this article, I’m happy to share with you the top 5 new features presented in the new version of scikit-learn (1.0).
Firstly, make sure you install the latest version (with pip):
pip install --upgrade scikit-learn
If you are using conda, use the following command:
conda install -c conda-forge scikit-learn
Note: Version 1.0.0 of scikit-learn requires python 3.7+, NumPy 1.14.6+ and scipy 1.1.0+. Optional minimal dependency is matplotlib 2.2.2+
Now, let’s look at the new features!
Scikit-learn 1.0 has introduced new flexible plotting API such as metrics.PrecisionRecallDisplay, metrics.DetCurveDisplay, and inspection.PartialDependenceDisplay.
This Plotting API comes with two class methods:
(a) from_estimator()
This class method allows you to fit a model and plot the results at the same time.
Let's look at an example by using PrecisionRecallDisplay to visualize Precision and Recall.
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
classifier= RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
disp_confusion = PrecisionRecallDisplay.from_estimator(classifier,
X_test,
y_test)
plt.show()
(b) from_predicitons()
In this class method, you can just pass prediction results and get your plots.
Let's look at an example by using ConfusionMatrixDisplay to visualize the confusion matrix.
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
classifier= RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
disp_confusion = ConfusionMatrixDisplay.from_predictions(predictions,
y_test,
display_labels=classifier.classes_)
plt.show()
In the new version of scikit-learn, you can track the names of the columns of your pandas dataframe when working with transformers or estimators.
When you pass a dataframe to an estimator and call the fit method, the estimator will store the features name in the feature_names_in_ attribute.
from sklearn.preprocessing import StandardScaler
import pandas as pd
X = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["age", "days", "duration"])
scalar = StandardScaler().fit(X)
print(scalar.feature_names_in_)
array(['age', 'days', 'duration'], dtype=object)
Note: feature names support is only enabled when the column names in the dataframe are all strings.
This is a new feature in feature selection that can measure the linear relationship between each feature and the target for the regression tasks. It is also known as the pearson’s r.
The cross-correlation between each regressor and the target is computed as
((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)).
Note: Where X is the features of the dataset and y is the target variable.
The following example shows how you can compute the Pearson’s r for each feature and the target.
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import r_regression
X, y = fetch_california_housing(return_X_y=True)
print(X.shape)
p = r_regression(X,y)
print(p)
(20640, 8)
[ 0.68807521 0.10562341 0.15194829 -0.04670051 -0.02464968 -0.02373741 -0.14416028 -0.04596662]
The OneHot Encoder in scikit-learn 1.0 can accept values it has not seen before. You just need to set a parameter called handle_unknown to 'ignore' (handle_unknown='ignore') when instantiating the transformer.
When you transform data with an unknown category, the encoded columns for this feature will be all zero values.
In the following example, we pass an unknown category when we transform the data given.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['secondary'], ['primary'], ['primary']]
enc.fit(X)
transformed = enc.transform([['degree'], ['primary'],['secondary']]).toarray()
print(transformed)
[[0. 0.]
[1. 0.]
[0. 1.]]
Note: In the inverse transform, an unknown category will be labeled as None.
The two supervised learning algorithms introduced in the previous version of scikit-learn 0.24 (HistGradientBoostingRegressor and HistGradientBoostingClassifier) are no longer experimental and you can simply import and use them as:
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
There are more new features in scikit-learn 1.0.0 that I did not mention in this article. You can find the highlights of other features released in scikit-learn 1.0.0 here.
Congratulations, you have made it to the end of this article! I hope you have learned something new that will help you on your next machine learning project.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
You can also find me on Twitter @Davis_McDavid.
And you can read more articles like this here.