Scikit-learn remains one of the most popular open-source and free machine learning libraries for Python. The scikit-learn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction. Many data scientists, machine learning engineers, and researchers rely on this library for their projects. I personally love using the scikit-learn library because it offers a ton of flexibility and it’s easy to understand its documentation with a lot of examples. machine learning In this article, I’m happy to share with you the 5 best new features in scikit-learn 0.24. Install the Latest Version of the Scikit-Learn Library Firstly, make sure you install the latest version (with pip): pip install --upgrade scikit-learn If you are using conda, use the following command: conda -c conda-forge scikit-learn install This version supports Python versions to . Note: 3.6 3.9 Now, let’s look at the new features! 1.Mean Absolute Percentage Error (MAPE) The new version of scikit-learn introduces a new evaluation metric for a regression problem called Mean Absolute Percentage Error(MAPE). Previously you could calculate MAPE by using a piece of code. np.mean(np.abs((y_test — preds)/y_test)) But now you can call a function called from sklearn.metrics module to evaluate the performance of your regression model. mean_absolute_percentage_error Example: sklearn.metrics mean_absolute_percentage_error
y_true = [ , , , ]
y_pred = [ , , , ]
print(mean_absolute_percentage_error(y_true, y_pred)) from import 3 -0.5 2 7 2.5 0.0 2 8 0.3273809523809524 Keep in mind that the function does not represent the output as a percentage in the range [0, 100]. Instead, we represent it in the range [0, 1/eps]. The best value is Note: 0.0. 2. OneHotEncoder Supports Missing Values can now handle missing values if presented in the dataset. It treats a missing value as a category. Let’s understand more about how it works in the following example. OneHotEncoder First import important packages. pandas pd numpy np sklearn.preprocessing OneHotEncoder import as import as from import Create a simple data-frame with a categorical feature that has missing values. data = { :[ , , , np.nan, ,np.nan]} df = pd.DataFrame(data) print(df) # intialise data of lists. 'education_level' 'primary' 'secondary' 'bachelor' 'masters' # Create DataFrame # Print the output. As you can see, we have two missing values in our column. education_level Create the instance of OneHotEncoder. enc = OneHotEncoder() Then fit and transform our data. enc.fit_transform(df).toarray() Our education_level column has been transformed and all missing values treated as a new category (check the last column of the above array). 3.New method for Feature Selection is a new method for feature selection in scikit-learn. It can be either forward selection or backward selection. SequentialFeatureSelector (a)Forward Selection It iteratively finds the best new feature and then adds it to the set of selected features. This means we start with zero features and then find a feature that maximizes the cross-validation score of an estimator. The selected feature is added to the set and the procedure is repeated until we reach the desired number of selected features. (b) Backward Selection This second selection follows the same idea but in a different direction. Here we start with all features and then remove a feature from the set until we reach the desired number of selected features. Example Import important package. sklearn.feature_selection SequentialFeatureSelector sklearn.neighbors KNeighborsClassifier sklearn.datasets load_iris from import from import from import Load the iris dataset and its feature names. X, y = load_iris(return_X_y= , as_frame= )
feature_names = X.columns True True Create the instance of the estimator. knn = KNeighborsClassifier(n_neighbors= ) 3 Create the instance of SequentialFeatureSelector, set the number of features to select to be , and set the direction to be “ ”. 2 backward sfs = SequentialFeatureSelector(knn, n_features_to_select= ,direction= ) 2 'backward' Finally learn the features to select. sfs.fit(X,y) Show selected features. print( ) "Features selected by backward sequential selection: " f" " {feature_names[sfs.get_support()].tolist()} Features selected by backward sequential selection: ['petal length (cm)', 'petal width (cm)'] The only downside of this new feature selection method is that it can be slower than other methods you already know ( SelectFromModel & RFE ) because it evaluates models with cross-validation. 4.New Methods for Hyper-Parameters Tuning When it comes to hyper-parameters tuning, GridSearchCV and RandomizedSearchCv from Scikit-learn have been the first choice for many data scientists. But in the new version, we have two new classes for hyper-parameters tuning called and . HalvingGridSearchCV HalvingRandomSearchCV HalvingGridSearchCV and HalvingRandomSearchCV use a new approach called to find the best hyperparameters. Successive halving is like competition or tournament among all hyperparameter combinations. successive halving How does successive halving work? In the first iteration, they train a combination of hyper-parameters on a subset of observations(training data). Then in the next iteration, only a combination of hyper-parameters that have good performance in the first iteration are selected and they will be trained in a large number of observations to compete. So this selection process is repeated in each iteration until the best combination of hyperparameters is selected in the final iteration. These classes are still experimental: Note: Example: Import important packages. sklearn.datasets make_classification sklearn.ensemble RandomForestClassifier sklearn.experimental enable_halving_search_cv sklearn.model_selection HalvingRandomSearchCV scipy.stats randint from import from import from import from import from import Since these new classes are still experimental, to use them, we explicitly import : enable_halving_search_cv Create a classification dataset by using the make_classification method. X, y = make_classification(n_samples= ) 1000 Create the instance of the estimator. Here we use a Random Forest Classifier. clf = RandomForestClassifier(n_estimators= ) 20 Create parameter distribution for tuning. param_dist = { : [ , ], : randint( , ), : randint( , ), : [ , ], : [ , ]} "max_depth" 3 None "max_features" 1 11 "min_samples_split" 2 11 "bootstrap" True False "criterion" "gini" "entropy" Then we instantiate the HalvingGridSearchCV class with the RandomForestClassifier as an estimator and the list of parameter distributions. rsh = HalvingRandomSearchCV(
    estimator=clf,
    param_distributions=param_dist,
    cv = ,
    factor= ,
    min_resources = ) 5 2 20 There are two important parameters in HalvingRandomSearchCV you need to know. (a) - This determines the proportion of the combination of hyper-parameters that are selected for each subsequent iteration. For example, means that only one-third of the candidates are selected for the next iteration. factor factor=3 (b) is the amount of resources(number of observations) allocated at the first iteration for each combination of hyper-parameters. min_resources Finally, we can fit the search object that we have created with our dataset. rsh.fit(X,y) After training, we can see different output such as:- The number of iterations. print(rsh.n_iterations_ ) 6 The number of candidate parameters that were evaluated at each iteration. print(rsh.n_candidates_ ) [50, 25, 13, 7, 4, 2] The number of resources used at each iteration. print(rsh.n_resources_) [20, 40, 80, 160, 320, 640] Parameter setting that gave the best results on the hold-out data. print(rsh.best_params_) {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 5, 'min_samples_split': 2} 5. New self-training meta-estimator for semi-supervised learning Scikit-learn 0.24 has introduced a new self-training implementation for semi-supervised learning called . SelfTrainingClassifier can be used with any supervised classifier that can return probability estimates for each class. SelfTrainingclassifier This means any supervised classifier can function as a semi-supervised classifier, allowing it to learn from unlabeled observations in the dataset. The unlabeled values in the target column must have a value of -1. Note: Let’s understand more about how it works in the following example. Import important packages. numpy np sklearn datasets sklearn.semi_supervised SelfTrainingClassifier sklearn.svm SVC import as from import from import from import In this example, we will use the iris dataset and  Super vector machine algorithm as a supervised classifier (It can implement and ). fit predict_proba Then we load the dataset and select randomly some of the observations to be unlabeled. rng = np.random.RandomState( )
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[ ]) < iris.target[random_unlabeled_points] = 42 0 0.3 -1 As you can see unlabeled values in the target column have a value of -1. Create an instance of the supervised estimator. svc = SVC(probability= , gamma= ) True "auto" Create an instance of self-training meta estimator and add svc as a base_estimator. self_training_model = SelfTrainingClassifier(base_estimator=svc) Finally, we can train self_traning_model on the iris dataset that has some unlabeled observations. self_training_model.fit(iris.data, iris.target) SelfTrainingClassifier(base_estimator=SVC(gamma='auto', probability=True)) Final Thoughts on Scikit-Learn 0.24 As I said, scikit-learn remains one the most popular open-source machine learning libraries, with all available for you to do an end-to-end machine learning project. You can also implement new impressive features presented in this article in your machine learning project. features You can find the highlights of other features released in scikit-learn 0.24 . here Congratulations 👏👏, you have made it to the end of this article! I hope you have learned something new that will help you on your next machine learning or data science project. If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! You can also find me on Twitter @Davis_McDavid. And you can read more articles like this here.

Super

Target

Twitter

10 Best African Language  Datasets for Data Science Projects

Scikit-Learn 0.24: Top 5 New Features

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

10 Best African Language Datasets for Data Science Projects

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

10 Best African Language Datasets for Data Science Projects

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps