paint-brush
Mobile Price Classification: An Open Source Data Science Project with Dagshubby@davisdavid
2,581 reads
2,581 reads

Mobile Price Classification: An Open Source Data Science Project with Dagshub

by Davis DavidDecember 9th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Machine learning models are often developed in a training environment, which may be online or offline, and can then be deployed to be used with live data once they have been tested. The next step is to split the data set into train and test dataframes for training and validation. We will use the Mobile Price dataset to classify the price range into different categories mentioned below: 0 (low cost)1 (medium cost)2 (high cost)3 (very high cost) The data set contains 21 columns and the target column is “The target column_range”

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Mobile Price Classification: An Open Source Data Science Project with Dagshub
Davis David HackerNoon profile picture

Machine learning models are often developed in a training environment, which may be online or offline, and can then be deployed to be used with live data once they have been tested.

One of the most critical talents you’ll need to have if you work on projects involving data science and machine learning is the ability to deploy a model.

Model deployment is the process of integrating your model into an existing production environment. The model will receive input and predict an output. You are going to learn how to manage your machine learning project and deploy a machine learning model into production using the following open-source tools:

1. Dagshub
It is a web platform for data scientists and machine learning engineers to host and version code, data, experiments and machine learning models integrated with other open source tools like:

  • Git — tracking source code and other files 
  • DVC — tracking data and machine learning models
  • MLflow — tracking machine learning experiments.


2. Streamlit
It is an open-source Python library for creating and sharing web applications for projects in data science and machine learning. The library can assist you in developing and deploying a data science solution in a matter of minutes using only a few lines of code.

In this tutorial will cover the following topics:

  • Create and manage your machine learning project with Dagshub.
  • Build an ML model to classify mobile price ranges.
  • Deploy your ML model using Streamlit to create a simple Data science web app.

So let's get started.

How to Create a Project using Dagshub

After creating your account on Dagshub, you will be given different options to start creating your first project with Dagshub.

  • New Repository: Create a new repository directly on the Dagshub platform.
  • Migrate A Repo: Migrate a repository from GitHub to Dagshub.
  • Connect a Repo: Connect and manage your repository through both Github and Dagshub.


There should be a lot of similarities between the interface of your new repository on DagsHub and the interface of your existing repository on GitHub. However, there should be some additional tabs, such as Experiments, Reports and Annotations.

You can clone and give a star in this repository on DagsHub to follow along throughout the article.

Mobile Price Dataset

We will use the Mobile Price dataset to classify the price range into different categories mentioned below.

  • 0 (low cost)
  • 1 (medium cost)
  • 2 (high cost)
  • 3 (very high cost)

The dataset is available here.

We have one available in the Data folder, data.csv.We will be splitting the data set into train and test dataframes for training and validation.

Packages Installation 

In this project, we will use the following python packages.

  • Pandas for data manipulation.
  • sklearn for training machine learning algorithms.
  • MLflow for tracking machine learning experiments.
  • DVC (data version control) for tracking and version datasets and machine learning models.
  • Joblib for saving and loading machine learning models.
  • Streamlit for deploying the machine learning model in a web app.

All these packages are listed in the requirement.txt file. Install these packages by running the following command in your terminal.

pip install -r requirements.txt

Import Python Packages

After installing all packages, you need to import the packages before starting to use them.

# import packages

import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import mlflow

mlflow.sklearn.autolog()  # set autlog for sklearn
mlflow.set_experiment('Ml-classification-experiment')
import joblib
import json
import os

np.random.seed(1234)

Note: With MLflow, you can automatically track machine learning experiments by using a function called autolog() from mlflow.skearn module.

Load and Version the Mobile Price Dataset

raw_data = pd.read_csv("data/raw/data.csv")

Data version control (DVC) is an open-source solution that allows you to track changes to your machine learning project’s data as well as its models. Following the completion of the account creation process, Dagshub will provide you with 10 GB of free storage for DVC.

Within each repository, Dagshub will automatically generate a remote storage link as well as a list of commands to get your data tracking process started.

Running the following command to add the Dagshub DVC remote.

dvc remote add origin https://dagshub.com/Davisy/Mobile-Price-ML-Classification-Project.dvc

Note: The above command will add the repository as the remote for the DVC storage and the URL will be slightly different from what you have seen.

Then you can start tracking the dataset with the following command.

dvc commit -f data / raw.dvc

Let’s check the shape of the dataset.

print(raw_data.shape)

The dataset contains 21 columns(20 features and 1 target) and luckily this dataset has no missing values.

Split the mobile price data into features and target. The target column is called “price_range”.

features = raw_data.drop(['price_range'], axis=1)

target = raw_data.price_range.values

Data Preprocessing

The features must be standardized before fitting into the machine learning algorithms. We will use Standardscaler from scikit-learn to perform the task.

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

The next step is to split the data into train and validate set. 20% of the mobile price dataset will be used for validation.

X_train, X_valid, y_train, y_valid = train_test_split(features_scaled, target, test_size=0.2,
                                          stratify=target,
                                          random_state=1)

Here is the sample of the train set (first row of X_train).

print(X_train[0])
[ 1.56947055 -0.9900495   1.32109556 -1.01918398  0.15908825 -1.04396559
 -1.49088996  1.03435682  0.61459469  0.20963905  1.00341448 -0.93787756
 -0.57283137 -1.3169798   0.40204724  1.43112714  0.73023981  0.55964063 0.99401789  0.98609664]

We need to track the processed data with DVC for efficiency and reproducibility.First, we create a dataframes for both the train set and the valid set and finally save them in a processed folder as shown in the block of code below.

# create a dataframe for train set
X_train_df = pd.DataFrame(X_train, columns=list(features.columns))
y_train_df = pd.DataFrame(y_train, columns=["price_range"])

#combine features and target for train set
train_df = pd.concat([X_train_df, y_train_df], axis=1)

# create a dataframe for traine set
X_valid_df = pd.DataFrame(X_valid, columns=list(features.columns))
y_valid_df = pd.DataFrame(y_valid, columns=["price_range"])
#combine features and target for train set
valid_df = pd.concat([X_valid_df, y_valid_df], axis=1)
# save processed train and valid set
train_df.to_csv('data/processed/data_train.csv', index_label='Index')
valid_df.to_csv('data/processed/data_valid.csv', index_label='Index')

Then run the following command to track the processed data (train and valid sets).

dvc commit -f process_data.dvc

Finally, we can save the trained standard scaler by using the dump method from the joblib package.

# save the trained scaler
joblib.dump(scaler, 'model/mobile_price_scaler.pkl')

Note: We will use the trained scaler in the streamlit web app.

Training Machine Learning Algorithms

MLflow is a great open-source machine learning experimentation package. You can use it to package and deploy Machine learning projects but in this article, we’ll concentrate on its tracking API.

We will use free tracking servers provided by Dagshub so that all MLflow files are saved remotely in the repository and anyone who can access your project will be able to view them.

To send machine learning experiments results to the tracking server, you need to set the tracking URL, your Dagshub username and password as follows.

Note: You just need to copy the remote tracking URL for MLflow in your Dagshub repository.

# using MLflow tracking

mlflow.set_tracking_uri("https://dagshub.com/Davisy/Mobile-Price-ML-Classification-Project.mlflow")

os.environ["MLFLOW_TRACKING_USERNAME"] = "username"
os.environ["MLFLOW_TRACKING_PASSWORD"] = "password"

Note: The experiment results will be logged directly to the Dagshub repository under the Experiments tab.

Finally, we need to run some machine learning experiments. First, we split features and target from both train and valid sets.

# load the processed data for both train and valid set

X_train = train_df[train_df.columns[:-1]]
y_train = train_df['price_range']

X_valid = valid_df[valid_df.columns[:-1]]
y_valid = valid_df['price_range']

The first experiment is to train the Random forest algorithm on the train set and check performance on the valid test.

# train randomforest algorithm

rf_classifier = RandomForestClassifier(n_estimators=200, criterion="gini")

with mlflow.start_run():
    #train the model
    rf_classifier.fit(X_train, y_train)

    #make predictions
    y_pred = rf_classifier.predict(X_valid)

    #check performance
    score = accuracy_score(y_pred, y_valid)

mlflow.end_run()

print(score)

The above block of code will perform the following tasks:

  • Instantiate the Random forest algorithmStart the MLflow run.
  • Train the machine learning model.
  • Make predictions on the validation set.check the accuracy of the machine learning model.
  • End the MLflow run.
  • Finally, print the accuracy score of the machine learning model.

The accuracy score is 0.895 for the Random forest algorithm.

Note: We use the autolog function in mlflow.sklearn to automatically keep track of the experiment. This means it will automatically track model parameters, metrics, files and similar information.

You can change the default parameters of the Randomforest algorithms to run multiple experiments and find out which values provide the best performance.

Let’s try to run another experiment using the Logistic Regression algorithm.

# train logistic regression algorithm

lg_classifier = LogisticRegression(penalty='l2', C=1.0)

with mlflow.start_run():
    #train the model
    lg_classifier.fit(X_train, y_train)

    #make predictions
    y_pred = lg_classifier.predict(X_valid)

    #check performance
    score = accuracy_score(y_pred, y_valid)

mlflow.end_run()

print(score)

The accuracy score is 0.97 for Logistic Regression. This machine learning model performs better than the Random forest algorithm.

Here is the list of machine learning experiments recorded on DagsHub under the Experiments tab.

The Experiments tab on the Dagshub provides different features to analyze the experiment results such as comparing one experiment to another using different metrics.

You also need to track the version of the model by running the following command.

dvc commit -f model.dvc

Register the Best Model with MLflow

We will use Mlflow registry to maintain and manage the version of the machine learning model. You need to know the run_id that produces the model with the best performance. You can find the run_id by clicking on the experiment name (‘Ml-classification-experiment’) within the Experiments Tab.

In this example, the run_id for the logistic regression model is ‘17ccd85b4c7e491bbdbcba58b5eafae1’. Then you use the register_model() function from MLflow to perform the task.

# Grab the run ID
run_id = '17ccd85b4c7e491bbdbcba58b5eafae1'

# Select a subpath name for the run
subpath = "best_model"

# Select a name for the model to be registered
model_name = "Logistic Regression Model"

# build the run URI
run_uri = f'runs:/{run_id}/{subpath}'

# register the model
model_version = mlflow.register_model(run_uri, model_name)

Output:

Successfully registered model 'Logistic Regression Model'.
2022/11/10 00:22:33 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: Logistic Regression Model, version 1
Created version '1' of model 'Logistic Regression Model'.

Deploy logged Model in MLflow with Streamlit 

Streamlit is an open-source Python toolkit for building and sharing data science web apps. You can use streamlit to deploy your data science solution in a short period of time with a few lines of code.

Streamlit integrates easily with prominent python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, and others in Data science.

In this part, we are going to deploy the logged model in MLflow (logistic regression model) in order to classify the price range for mobile phones.

Create app.py file

The first step is to create a python file called app.py which will have all the source code to run the data science web app.

 Import Packages

Then you need to import packages to run both streamlit and the best trained model.

# import packages
import streamlit as st
import pandas as pd
import numpy as np
from os.path import dirname, join, realpath
import joblib

Create App Title and Description

You can set the header, image and subheader for your data science web app using three different methods from streamlit called header(),image() and subheader() as shown in the code below.

# add banner image
st.header("Mobile Price Prediction")
st.image("images/phones.jpg")
st.subheader(
    """
A simple machine learning app to  classify mobile price range
"""
)

Create a Form to Receive a Mobile’s details

We need a simple form that will receive mobile details in order to make predictions. Streamlit has a method called a form() that can help you create a form with different fields such as number, multiple choice, text and others.

# form to collect mobile phone details
my_form = st.form(key="mobile_form")

@st.cache
# function to transform Yes and No options
def func(value):
    if value == 1:
        return "Yes"
    else:
        return "No"

battery_power = my_form.number_input(
    "Total energy a battery can store in one time measured in mAh", min_value=500
)
blue = my_form.selectbox("Has bluetooth or not", (0, 1), format_func=func)

clock_speed = my_form.number_input(
    "speed at which microprocessor executes instructions", min_value=1
)

dual_sim = my_form.selectbox("Has dual sim support or not", (0, 1), format_func=func)

fc = my_form.number_input(
    "Front Camera mega pixels", min_value=0
)

four_g = my_form.selectbox("Has 4G or not", (0, 1), format_func=func)

int_memory = my_form.number_input(
    "Internal Memory in Gigabytes", min_value=2
)

m_dep = my_form.number_input(
    "Mobile Depth in cm", min_value=0
)

mobile_wt = my_form.number_input(
    "Weight of mobile phone", min_value=80
)

n_cores = my_form.number_input(
    "Number of cores of processor", min_value=1
)
pc = my_form.number_input(
    "Primary Camera mega pixels", min_value=0
)

px_height = my_form.number_input(
    "Pixel Resolution Height", min_value=0
)

px_width = my_form.number_input(
    "Pixel Resolution Width", min_value=0
)

ram = my_form.number_input(
    "Random Access Memory in Mega Bytes", min_value=256
)

sc_h = my_form.number_input(
    "Screen Height of mobile in cm", min_value=5
)

sc_w = my_form.number_input(
    "Screen Width of mobile in cm", min_value=0
)

talk_time = my_form.number_input(
    "longest time that a single battery charge will last when you are", min_value=2
)

three_g = my_form.selectbox("Has 3G or not", (0, 1), format_func=func)

touch_screen = my_form.selectbox("Has touch screen or not", (0, 1), format_func=func)

wifi = my_form.selectbox("Has wifi or not", (0, 1), format_func=func)

submit = my_form.form_submit_button(label="make prediction")

The above block of code contains all the fields to fill in the mobile details and a simple button to submit the details and then make a prediction.

Load logged Model in MLflow and Scaler

Then you need to load both the logged model in MLflow model for predictions and the scaler for input transformation. The load() method from the joblib package will perform the task.

# load the mlflow registered model and scaler
mlflow_model_path = "mlruns/1/17ccd85b4c7e491bbdbcba58b5eafae1/artifacts/model/model.pkl"
with open(
        join(dirname(realpath(__file__)), mlflow_model_path),
        "rb",
) as f:
    model = joblib.load(f)
scaler_path  = "model/mobile_price_scaler.pkl"
with open(join(dirname(realpath(__file__)), scaler_path ), "rb") as f:
    scaler = joblib.load(f)

Create Result Dictionary

The trained model will predict the output into numbers (0,1,2 or 3). For a better user experience, we can use the following dictionary to present the actual meaning.

# result dictionary
result_dict = {
    0: "Low Cost",
    1: "Medium Cost",
    2: "High Cost",
    3: "Very High Cost",
}

Make Predictions and Show Results

Our last block of code is to make predictions and show results whenever a user adds mobile details and clicks the “make prediction” button on the form section.

After clicking the button, the web app will perform the following tasks:

  • Collect all the inputs (mobile details).
  • Create a dataframe for the inputs.
  • Transform the input using the Scaler.
  • Perform prediction on the transformed inputs.
  • Display the results of the mobile price according to the result dictionary (result_dict).
if submit:
    # collect inputs
    input = {
        'battery_power': battery_power,
        'blue': blue,
        'clock_speed': clock_speed,
        'dual_sim': dual_sim,
        'fc': fc,
        'four_g': four_g,
        'int_memory': int_memory,
        'm_dep': m_dep,
        'mobile_wt': mobile_wt,
        'n_cores': n_cores,
        'pc': pc,
        'px_height': px_height,
        'px_width': px_width,
        'ram': ram,
        'sc_h': sc_h,
        'sc_w': sc_w,
        'talk_time': talk_time,
        'three_g': three_g,
        'touch_screen': touch_screen,
        'wifi': wifi,
    }

    # create a dataframe
    data = pd.DataFrame(input, index=[0])

    # transform input
    data_scaled = scaler.transform(data)

    # perform prediction
    prediction = model.predict(data_scaled)
    output = int(prediction[0])

    # Display results of the Mobile price prediction
    st.header("Results")
    st.write(" Price range is {} ".format(result_dict[output]))

Test the Data Science Web App

We have successfully created a simple web app to deploy the logged model in MLflow and predict the price range.

To run the web app, you need to use the following command in your terminal.

streamlit run app.py

The web app will then appear instantly in your web browser, or you can access it using the local URL http://localhost:8501.

You need to fill in the mobile details and then click the make prediction button to see the prediction result.

After filling in the mobile details and clicking the make prediction button, the machine learning model predicts that the price range is Very High Cost.

Deploy Streamlit Web App in the Streamlit Cloud

The final step is to make sure the streamlit app is available to anyone who wants to access it and use our machine learning model to predict the mobile price range.

Streamlit cloud allows you to deploy your streamlit web app for free on the cloud. You just need to follow the steps below:

  1. Create a new GitHub Repository on GitHub
  2. Add your streamlit web app (app.py), model folder and requirements.txt.
  3. Create your account on a streamlit cloud platform 
  4. Create a new app and then link your GitHub repository that you created by typing the name of the repository.
  5. Change the streamlit app file name from streamlit_app.py to app.py
  6. Finally, click the Deploy button.

After the streamlit cloud finished installing the streamlit app and all of its prerequisites, your application will finally be live and accessible to anyone with a link provided by streamlit cloud.

link: https://davisy--mobile-price-predecition-streamlit-app-app-7clkzd.streamlit.app/

Conclusion

You have gained expertise in data and model tracking with data version control (DVC), as well as tracking machine learning experiments with MLflow and DagsHub. You can share the results of your machine learning experiments with the world, both successful and failed. You have also gained powerful tools that will assist you in efficiently organizing your machine learning project.

In this tutorial, you have learned:

  • How to create your first Daghubs repository.
  • How to track your data using data version control (DVC) and connect to the Dagshub DVC remote.
  • How to automatically track your machine learning experiments using auto-logger classes from MLflow.
  • How to connect MLflow to a remote tracking server in the DagsHub.
  • How to create a Data science web app for your machine learning model using Streamlit.

You can download the source code used in this article here: https://dagshub.com/Davisy/Mobile-Price-ML-Classification-Project

If you learned something new or enjoyed reading this tutorial, please share it so that others can see it. Until then, I'll see you in the next article!

You can also find me on Twitter at @Davis_McDavid.