Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.
Building a machine learning model requires a series of steps, from data preparation, data cleaning, feature engineering, model building to model deployment. Therefore, it can take a lot of time for a data scientist to create a solution that solves a business problem.
To help speed up the process, you can use Pycaret, an open-source library. Pycaret can help you perform all the end-to-end processes of ML faster with few lines of code.
Pycaret is an open-source, low code library in python that aims to automate the development of machine learning models. This library is useful for any data scientist, analyst, ML engineer, or anyone learning machine learning to be more productive and reach conclusions faster.
The library has 70+ automated open-source algorithms and over 25+ pre-processing techniques that can help you build machine learning models with good performance. It supports supervised learning (classification and regression), clustering, anomaly detection, and natural language processing tasks.
PyCaret is a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy, XGBoost, Optuna, Hyperopt, Ray, and many more.
You don't have to worry about data preparation, feature engineering, feature selection, or hyperparameter tuning. Pycaret can perform all these tasks automatically with just a few lines of code.
Another benefit of the library is that after building your machine learning model you can directly deploy the transformation pipeline and trained model on Amazon Web Service (AWS), Microsoft Azure, or Google Cloud Platform(GCP).
For classification and regression problems, Pycaret uses the following evaluation metrics:
In this article, you will learn how to use the Pycaret library to automate the end-to-end process of machine learning faster with little manual configuration.
Installation is easy and takes only a few minutes. All dependencies are also installed with PyCaret.You can view a list of dependencies here.
pip install pycaret
In this tutorial, we will use "mobile price datasets" and the goal is to predict a price range indicating how high the price is.
You can download the dataset here:
You can load the dataset by using the pandas library.
# import packages import pandas as pd import numpy as np #load data data = pd.read_csv("/train.csv") data.head()
Let's check the shape of the dataset.
As you can see, the dataset has 20 features and 1 target.
The first step you need to do is preparing the environment to run your machine learning experiments. You need to initialize the setup() function from pycaret.classification module.
In the setup function, you need to define the dataframe for your dataset and the target variable, for this problem is price_range. You can also set the experiment name and other settings.
from pycaret.classification import * # setup the environment grid = setup(data=data, target=data.columns[-1], html=False, silent=True, verbose=True,log_experiment = True, experiment_name = 'mobile_prices')
Note: As I have said before, Pycaret handles all data preprocessing automatically and these steps are applied within setup() and all the operations performed in PyCaret are sequentially stored in a Pipeline.
To create a model in pycaret is very simple and straightforward. You need to add only one parameter i.e the model name in the create_model() function.
The create_model will train the algorithm and return a table with k-fold cross-validated scores and the means from different evaluation metrics such as accuracy and F1.
In this example, we can train K Neighbors classifier by passing the string input called “knn”. You can click here to see a complete list of more than 60 estimators available in the Pycaret library.
#create model knn = create_model('knn')
As you can see the mean accuracy is 91.85%.
With Pycaret, you can train and evaluate the performance of all estimators available in the model library using K-fold validation. The compare_models() function will return a score grid with average cross-validated scores from all estimators.
best = compare_models()
The table above is sorted by using accuracy metric and the estimators that perform well is K Neighbors Classifier followed by Linear Discriminant Analysis.
Sometimes accuracy is not a good evaluation metric depending on the nature of your dataset. You can choose other evaluation metrics to determine which model performs better than others.
You can also improve the performance of your model by tuning its hyperparameters. The tune_model() function from Pycaret can automatically tune the hyperparameter of a machine learning model by using different search algorithms such as:
Now we can tune the KNN model to improve its performance.
#tune model tuned_knn = tune_model(knn)
The output of the function is a score grid with CV scores and the trained model object.
After tuning the hyperparameters of the KNN model, the performance has improved from 91.85% to 93.00%.
You can evaluate your trained model by using the evaluate_model() function from Pycaret. The function displays a user interface for analyzing the performance of a trained model.
You can view the following plots and other performance details such as:
Note: This function only works in IPython enabled Notebook.
To make a prediction on unseen data, you can use the predict_model() function. For a classification problem, the function predicts Label and Score (probability of predicted class) using a trained model. When data is none, it predicts label and score on the test or holdout set which is 30% of the dataset (by default).
holdout_prediction = predict_model(tuned_knn)
The tuned KNN model still performs well on the test set with an accuracy of 93.84%.
After training and doing a lot of machine learning experiments to get the best performance, you can save the entire pipeline containing all preprocessing steps and trained model object as a binary pickle file by using the save_model() function.
You need to pass the trained model object and the name of the model that will be used to create a pickle file.
# saving model save_model(tuned_knn, model_name = 'knn_model')
In this article, you have learned the most important steps to build machine learning models by using the Pycaret library. The library has a lot of modules and examples to help you build machine learning models in different cases. Check the following resources if you are looking to go deeper.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
You can also find me on Twitter @Davis_McDavid.
And you can read more articles like this here.
Want to keep up to date with all the latest in python? Subscribe to our newsletter in the footer below.
Create your free account to unlock your custom reading experience.