Mobile Price Classification: An Open Source Data Science Project with Dagshub

Machine learning models are often developed in a training environment, which may be online or offline, and can then be deployed to be used with live data once they have been tested. One of the most critical talents you’ll need to have if you work on projects involving data science and machine learning is the ability to deploy a model. is the process of integrating your model into an existing production environment. The model will receive input and predict an output. You are going to learn how to manage your machine learning project and deploy a machine learning model into production using the following open-source tools: Model deployment It is a web platform for data scientists and machine learning engineers to host and version integrated with other open source tools like: 1. Dagshub code, data, experiments and machine learning models — tracking source code and other files Git — tracking data and machine learning models DVC — tracking machine learning experiments. MLflow It is an open-source Python library for creating and sharing web applications for projects in data science and machine learning. The library can assist you in developing and deploying a data science solution in a matter of minutes using only a few lines of code. 2. Streamlit In this tutorial will cover the following topics: Create and manage your machine learning project with . Dagshub Build an ML model to classify mobile price ranges. Deploy your ML model using to create a simple Data science web app. Streamlit So let's get started. How to Create a Project using Dagshub After creating your account on Dagshub, you will be given different options to start creating your first project with Dagshub. Create a new repository directly on the Dagshub platform. New Repository: Migrate a repository from GitHub to Dagshub. Migrate A Repo: Connect and manage your repository through both Github and Dagshub. Connect a Repo: There should be a lot of similarities between the interface of your new repository on DagsHub and the interface of your existing repository on GitHub. However, there should be some additional tabs, such as Experiments, Reports and Annotations. You can clone and give a star in on DagsHub to follow along throughout the article. this repository Mobile Price Dataset We will use the Mobile Price dataset to classify the price range into different categories mentioned below. 0 (low cost) 1 (medium cost) 2 (high cost) 3 (very high cost) The dataset is available . here We have one available in the Data folder, We will be splitting the data set into train and test dataframes for training and validation. data.csv. Packages Installation In this project, we will use the following python packages. for data manipulation. Pandas for training machine learning algorithms. sklearn for tracking machine learning experiments. MLflow (data version control) for tracking and version datasets and machine learning models. DVC for saving and loading machine learning models. Joblib for deploying the machine learning model in a web app. Streamlit All these packages are listed in the . Install these packages by running the following command in your terminal. requirement.txt file pip install -r requirements.txt Import Python Packages After installing all packages, you need to import the packages before starting to use them. # import packages import pandas as pd import numpy as np import sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression import mlflow mlflow.sklearn.autolog() # set autlog for sklearn mlflow.set_experiment( 'Ml-classification-experiment' ) import joblib import json import os np.random.seed( 1234 ) With MLflow, you can automatically track machine learning experiments by using a function called from module. Note: autolog() mlflow.skearn Load and Version the Mobile Price Dataset raw_data = pd.read_csv( "data/raw/data.csv" ) DVC) is an open-source solution that allows you to track changes to your machine learning project’s data as well as its models. Following the completion of the account creation process, Dagshub will provide you with 10 GB of free storage for DVC. Data version control ( Within each repository, Dagshub will automatically generate a remote storage link as well as a list of commands to get your data tracking process started. Running the following command to add the Dagshub DVC remote. dvc remote add origin https://dagshub.com/Davisy/Mobile-Price-ML-Classification-Project.dvc The above command will add the repository as the remote for the DVC storage and the URL will be slightly different from what you have seen. Note: Then you can start tracking the dataset with the following command. dvc commit -f data / raw.dvc Let’s check the shape of the dataset. print (raw_data.shape) The dataset contains 21 columns(20 features and 1 target) and luckily this dataset has no missing values. Split the mobile price data into features and target. The target column is called “price_range”. target = raw_data.price_range.values features = raw_data.drop([ 'price_range' ], axis= 1 ) Data Preprocessing The features must be standardized before fitting into the machine learning algorithms. We will use Standardscaler from s to perform the task. cikit-learn features_scaled = scaler.fit_transform(features) scaler = StandardScaler() The next step is to split the data into train and validate set. 20% of the mobile price dataset will be used for validation. stratify=target, X_train, X_valid, y_train, y_valid = train_test_split(features_scaled, target, test_size= 0.2 , random_state= 1 ) Here is the sample of the train set (first row of X_train). print (X_train[ 0 ]) [ 1.56947055 - 0.9900495 1.32109556 - 1.01918398 0.15908825 - 1.04396559 - 1.49088996 1.03435682 0.61459469 0.20963905 1.00341448 - 0.93787756 - 0.57283137 - 1.3169798 0.40204724 1.43112714 0.73023981 0.55964063 0.99401789 0.98609664 ] We need to track the processed data with DVC for efficiency and reproducibility.First, we create a dataframes for both the train set and the valid set and finally save them in a processed folder as shown in the block of code below. # create a dataframe for train set X_train_df = pd.DataFrame(X_train, columns= list (features.columns)) y_train_df = pd.DataFrame(y_train, columns=[ "price_range" ]) #combine features and target for train set train_df = pd.concat([X_train_df, y_train_df], axis= 1 ) # create a dataframe for traine set X_valid_df = pd.DataFrame(X_valid, columns= list (features.columns)) y_valid_df = pd.DataFrame(y_valid, columns=[ "price_range" ]) #combine features and target for train set valid_df = pd.concat([X_valid_df, y_valid_df], axis= 1 ) # save processed train and valid set train_df.to_csv( 'data/processed/data_train.csv' , index_label= 'Index' ) valid_df.to_csv( 'data/processed/data_valid.csv' , index_label= 'Index' ) Then run the following command to track the processed data (train and valid sets). dvc commit -f process_data.dvc Finally, we can save the trained standard scaler by using the dump method from the joblib package. # save the trained scaler joblib.dump(scaler, 'model/mobile_price_scaler.pkl' ) We will use the trained scaler in the streamlit web app. Note: Training Machine Learning Algorithms MLflow is a great open-source machine learning experimentation package. You can use it to package and deploy Machine learning projects but in this article, we’ll concentrate on its tracking API. We will use free tracking servers provided by Dagshub so that all MLflow files are saved remotely in the repository and anyone who can access your project will be able to view them. To send machine learning experiments results to the tracking server, you need to set the tracking URL, your Dagshub username and password as follows. You just need to copy the remote tracking URL for MLflow in your Dagshub repository. Note: # using MLflow tracking mlflow.set_tracking_uri( "https://dagshub.com/Davisy/Mobile-Price-ML-Classification-Project.mlflow" ) os.environ[ "MLFLOW_TRACKING_USERNAME" ] = "username" os.environ[ "MLFLOW_TRACKING_PASSWORD" ] = "password" The experiment results will be logged directly to the Dagshub repository under the . Note: Experiments tab Finally, we need to run some machine learning experiments. First, we split features and target from both train and valid sets. # load the processed data for both train and valid set X_train = train_df[train_df.columns[:- 1 ]] y_train = train_df[ 'price_range' ] X_valid = valid_df[valid_df.columns[:- 1 ]] y_valid = valid_df[ 'price_range' ] The first experiment is to train the Random forest algorithm on the train set and check performance on the valid test. rf_classifier.fit(X_train, y_train) y_pred = rf_classifier.predict(X_valid) score = accuracy_score(y_pred, y_valid) mlflow.end_run() # train randomforest algorithm rf_classifier = RandomForestClassifier(n_estimators= 200 , criterion= "gini" ) with mlflow.start_run(): #train the model #make predictions #check performance print (score) The above block of code will perform the following tasks: Instantiate the Random forest algorithmStart the MLflow run. Train the machine learning model. Make predictions on the validation set.check the accuracy of the machine learning model. End the MLflow run. Finally, print the accuracy score of the machine learning model. The accuracy score is for the Random forest algorithm. 0.895 We use the function in to automatically keep track of the experiment. This means it will automatically track model parameters, metrics, files and similar information. Note: autolog mlflow.sklearn You can change the default parameters of the Randomforest algorithms to run multiple experiments and find out which values provide the best performance. Let’s try to run another experiment using the Logistic Regression algorithm. lg_classifier.fit(X_train, y_train) y_pred = lg_classifier.predict(X_valid) score = accuracy_score(y_pred, y_valid) mlflow.end_run() # train logistic regression algorithm lg_classifier = LogisticRegression(penalty= 'l2' , C= 1.0 ) with mlflow.start_run(): #train the model #make predictions #check performance print (score) The accuracy score is for Logistic Regression. This machine learning model performs better than the Random forest algorithm. 0.97 Here is the list of machine learning experiments recorded on DagsHub under the Experiments tab. The on the Dagshub provides different features to analyze the experiment results such as comparing one experiment to another using different metrics. Experiments tab You also need to track the version of the model by running the following command. dvc commit -f model.dvc Register the Best Model with MLflow We will use Mlflow registry to maintain and manage the version of the machine learning model. You need to know the run_id that produces the model with the best performance. You can find the run_id by clicking on the experiment name (‘ ) within the Experiments Tab. Ml-classification-experiment’ In this example, the run_id for the logistic regression model is . Then you use the register_model() function from MLflow to perform the task. ‘17ccd85b4c7e491bbdbcba58b5eafae1’ model_version = mlflow.register_model(run_uri, model_name) # Grab the run ID run_id = '17ccd85b4c7e491bbdbcba58b5eafae1' # Select a subpath name for the run subpath = "best_model" # Select a name for the model to be registered model_name = "Logistic Regression Model" # build the run URI run_uri = f'runs:/ {run_id} / {subpath} ' # register the model Output: Successfully registered model 'Logistic Regression Model' . 2022 / 11 / 10 00 : 22 : 33 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: Logistic Regression Model, version 1 Created version '1' of model 'Logistic Regression Model' . Deploy logged Model in MLflow with Streamlit Streamlit is an open-source Python toolkit for building and sharing data science web apps. You can use streamlit to deploy your data science solution in a short period of time with a few lines of code. Streamlit integrates easily with prominent python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, and others in Data science. In this part, we are going to deploy the logged model in MLflow in order to classify the price range for mobile phones. (logistic regression model) Create app.py file The first step is to create a python file called which will have all the source code to run the data science web app. app.py Import Packages Then you need to import packages to run both streamlit and the best trained model. # import packages import streamlit as st import pandas as pd import numpy as np from os.path import dirname, join, realpath import joblib Create App Title and Description You can set the for your data science web app using three different methods from streamlit called , and as shown in the code below. header, image and subheader header() image() subheader() st.subheader( ) # add banner image st.header( "Mobile Price Prediction" ) st.image( "images/phones.jpg" ) """ A simple machine learning app to classify mobile price range """ Create a Form to Receive a Mobile’s details We need a simple form that will receive mobile details in order to make predictions. Streamlit has a method called a that can help you create a form with different fields such as . form() number, multiple choice, text and others battery_power = my_form.number_input( ) clock_speed = my_form.number_input( ) fc = my_form.number_input( ) int_memory = my_form.number_input( ) m_dep = my_form.number_input( ) mobile_wt = my_form.number_input( ) n_cores = my_form.number_input( ) pc = my_form.number_input( ) px_height = my_form.number_input( ) px_width = my_form.number_input( ) ram = my_form.number_input( ) sc_h = my_form.number_input( ) sc_w = my_form.number_input( ) talk_time = my_form.number_input( ) # form to collect mobile phone details my_form = st.form(key= "mobile_form" ) @st.cache # function to transform Yes and No options def func ( value ): if value == 1 : return "Yes" else : return "No" "Total energy a battery can store in one time measured in mAh" , min_value= 500 blue = my_form.selectbox( "Has bluetooth or not" , ( 0 , 1 ), format_func=func) "speed at which microprocessor executes instructions" , min_value= 1 dual_sim = my_form.selectbox( "Has dual sim support or not" , ( 0 , 1 ), format_func=func) "Front Camera mega pixels" , min_value= 0 four_g = my_form.selectbox( "Has 4G or not" , ( 0 , 1 ), format_func=func) "Internal Memory in Gigabytes" , min_value= 2 "Mobile Depth in cm" , min_value= 0 "Weight of mobile phone" , min_value= 80 "Number of cores of processor" , min_value= 1 "Primary Camera mega pixels" , min_value= 0 "Pixel Resolution Height" , min_value= 0 "Pixel Resolution Width" , min_value= 0 "Random Access Memory in Mega Bytes" , min_value= 256 "Screen Height of mobile in cm" , min_value= 5 "Screen Width of mobile in cm" , min_value= 0 "longest time that a single battery charge will last when you are" , min_value= 2 three_g = my_form.selectbox( "Has 3G or not" , ( 0 , 1 ), format_func=func) touch_screen = my_form.selectbox( "Has touch screen or not" , ( 0 , 1 ), format_func=func) wifi = my_form.selectbox( "Has wifi or not" , ( 0 , 1 ), format_func=func) submit = my_form.form_submit_button(label= "make prediction" ) The above block of code contains all the fields to fill in the mobile details and a simple button to the details and then make a prediction. submit Load logged Model in MLflow and Scaler Then you need to load both the logged model in MLflow model for predictions and the scaler for input transformation. The method from the joblib package will perform the task. load() join(dirname(realpath(__file__)), mlflow_model_path), model = joblib.load(f) scaler = joblib.load(f) # load the mlflow registered model and scaler mlflow_model_path = "mlruns/1/17ccd85b4c7e491bbdbcba58b5eafae1/artifacts/model/model.pkl" with open ( "rb" , ) as f: scaler_path = "model/mobile_price_scaler.pkl" with open (join(dirname(realpath(__file__)), scaler_path ), "rb" ) as f: Create Result Dictionary The trained model will predict the output into numbers ( ). For a better user experience, we can use the following dictionary to present the actual meaning. 0,1,2 or 3 result_dict = { } # result dictionary 0 : "Low Cost" , 1 : "Medium Cost" , 2 : "High Cost" , 3 : "Very High Cost" , Make Predictions and Show Results Our last block of code is to make predictions and show results whenever a user adds mobile details and clicks the “ ” button on the form section. make prediction After clicking the button, the web app will perform the following tasks: Collect all the inputs (mobile details). Create a dataframe for the inputs. Transform the input using the Scaler. Perform prediction on the transformed inputs. Display the results of the mobile price according to the result dictionary (result_dict). } data_scaled = scaler.transform(data) prediction = model.predict(data_scaled) if submit: # collect inputs input = { 'battery_power' : battery_power, 'blue' : blue, 'clock_speed' : clock_speed, 'dual_sim' : dual_sim, 'fc' : fc, 'four_g' : four_g, 'int_memory' : int_memory, 'm_dep' : m_dep, 'mobile_wt' : mobile_wt, 'n_cores' : n_cores, 'pc' : pc, 'px_height' : px_height, 'px_width' : px_width, 'ram' : ram, 'sc_h' : sc_h, 'sc_w' : sc_w, 'talk_time' : talk_time, 'three_g' : three_g, 'touch_screen' : touch_screen, 'wifi' : wifi, # create a dataframe data = pd.DataFrame( input , index=[ 0 ]) # transform input # perform prediction output = int (prediction[ 0 ]) # Display results of the Mobile price prediction st.header( "Results" ) st.write( " Price range is {} " . format (result_dict[output])) Test the Data Science Web App We have successfully created a simple web app to deploy the logged model in MLflow and predict the price range. To run the web app, you need to use the following command in your terminal. streamlit run app.py The web app will then appear instantly in your web browser, or you can access it using the local URL . http://localhost:8501 You need to fill in the mobile details and then click the make prediction button to see the prediction result. After filling in the mobile details and clicking the make prediction button, the machine learning model predicts that the price range is Very High Cost. Deploy Streamlit Web App in the Streamlit Cloud The final step is to make sure the streamlit app is available to anyone who wants to access it and use our machine learning model to predict the mobile price range. Streamlit cloud allows you to deploy your streamlit web app for free on the cloud. You just need to follow the steps below: Create a new GitHub Repository on GitHub Add your streamlit web app (app.py), model folder and requirements.txt. Create your account on a streamlit cloud platform Create a new app and then link your GitHub repository that you created by typing the name of the repository. Change the streamlit app file name from to streamlit_app.py app.py Finally, click the Deploy button. After the streamlit cloud finished installing the streamlit app and all of its prerequisites, your application will finally be live and accessible to anyone with a link provided by streamlit cloud. link: https://davisy--mobile-price-predecition-streamlit-app-app-7clkzd.streamlit.app/ Conclusion You have gained expertise in data and model tracking with data version control (DVC), as well as tracking machine learning experiments with MLflow and DagsHub. You can share the results of your machine learning experiments with the world, both successful and failed. You have also gained powerful tools that will assist you in efficiently organizing your machine learning project. In this tutorial, you have learned: How to create your first Daghubs repository. How to track your data using data version control (DVC) and connect to the Dagshub DVC remote. How to automatically track your machine learning experiments using auto-logger classes from MLflow. How to connect MLflow to a remote tracking server in the DagsHub. How to create a Data science web app for your machine learning model using Streamlit. You can download the source code used in this article here: https://dagshub.com/Davisy/Mobile-Price-ML-Classification-Project If you learned something new or enjoyed reading this tutorial, please share it so that others can see it. Until then, I'll see you in the next article! You can also find me on Twitter at . @Davis_McDavid