Linear Regression is one of the oldest and widely used Machine Learning algorithm which is used to train a model against two variables - Independent Variable and Dependent (Target) Variable. If you wish to learn more about AI and Machine Learning, you may see my blog on . Artificial Intelligence In this project, I will be training a model to predict Sports Sustainability. Sustainability in sports means conducting a sporting event that utilises environmentally friendly methods to reduce the negative impact on the environment. Just like every industry, sports has a supply chain issue. When enjoying sports like Cricket or Football, we tend to forget about environment. But we need to think beyond the tournament: who is making players’ kits and boots? Where is the water feeding the pitch coming from? How was the stadium built, and how is it maintained? What’s the impact of major tournaments like the Champions’ League or the Olympics, where hastily erected stadiums and hundreds of thousands of fans take over local areas?. Moreover, are certain actions – like disposable cups in stadiums, or the use of recycled fibres in kits – simply a sticking plaster over much wider issues across the entire supply chain? For this project, we have a dataset containing data about the number of suppliers of sports goods and the corresponding carbon emissions from them (in metric tons). I am going to use for training and evaluating the model. Before building the model, we need to analyze the data, for any errors or ambiguities, and identify the trends in the data. I will be using and for data analysis and as plotting library for plotting graphs and charts. So, without wasting any further time, let's jump right in. Scikit-learn Pandas Numpy Matplotlib Bootstrapping First, let's install and import the required packages as shown in the codeblock below. %pip install pandas numpy matplotlib import pandas as pd import numpy as np import matplotlib.pyplot as plt Data Analysis with Pandas With the project dependencies installed, let's move forward to import our dataset using pandas. In the first line, I have created the dataframe by importing dataset with function and in the second line, I dropped the year column as it does not provide any relevant information for our model. This is an example dataset and has only 16 data points, but production models are trained on much larger datasets containing hundreds and thousands of data points. pd.read_csv() https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/47add66952e24ad78656b37261f2be19?height=653&embedable=true Then, we will get the columns using and the shape of the dataset using . df.columns df.shape https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/6f03295e5fa247fd8862f4911e78ce4e?height=134.1875&embedable=true https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/1f2540fdbc064b7cac6a113828c1e165?height=134.1875&embedable=true Then, we will get information about the dataset using . It gives us the columns of the dataset, non-null count which means the number of values in the column which are not null and the data type of the values in the column. df.info() https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/19c79fe453fd400d91472d7b8f7c1d14?height=290.6875&embedable=true Then, I am going to analyze the mean, median, standard deviation, count, min and max and several percentages of the data using . df.describe() https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/6d1410f72c464752b5a62335dfb7c6ff?height=535&embedable=true Now, plot a scatter plot for the data points using pandas. https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/9f009f07042140edbfe097a4ce6972f5?height=538&embedable=true Model Training Now, we have completed analyzing the data, so, we will begin with model training. Before starting with model training, we are going to import the required modules. from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error import joblib Now, we need to initialize two variables - which is our independent variable and which is our dependent or target variable. contains the feature and contains the which we need to predict. x Y x Number of Suppliers Y Carbon Emissions from Suppliers (metric tons) https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/6e301a4f028f4ff1a671de61200f1260?height=645&embedable=true Now, we will split our dataset into training data and testing data. In the code block below, we are to a array and then assigning one part to which is our training data from variable and another part to which is our training data from variable. Similarly, we are creating and which holds our testing data. The size of testing dataframe is or approximately of the original dataset. x numpy x_train x y_train Y x_test y_test 0.2 20% https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/9616e92cc3a74dadbc7c1d0aca33684d?height=363.25&embedable=true Now, finally, it's time to fit the data into the model. To do so, first, we need to initialize the class imported from with a variable. Then we use to fit the data and our model is ready to use. LinearRegression() scikit-learn model model.fit() https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/c22bb707d9834956ab3a643e0dcd0122?height=191.34375&embedable=true Now, let's make predictions using the model built in the previous step. Here, we need to pass the as a array into the function. And we get the as our output. Number of Suppliers numpy model.predict() Carbon Emission in metric tons https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/7038028590fb4e47a900348d5f6c3ea7?height=134.1875&embedable=true Model Evaluation We have successfully built our model from scratch in the previous step. But, now, we need to evaluate our model's accuracy and performance. This is a very crucial step in Machine Learning Lifecycle. So, now let's begin with model evaluation. In model training stage, we created and and now we will be using those two to test the model. x_test y_test I will be creating an array named which will contain predicted values of data points of . y_pred x_test https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/e8e50d395a8440658bf37abe75be277f?height=209.75&embedable=true We will be using score, intercept, coefficient and R² score. The score of the model is 99.671% at the time of publishing this blog, which is quite good. R² score is quite fine. https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/bde2413106204e198c824a2a9156fd0f?height=338.125&embedable=true Now, I am going to compare the values of which is the actual value of the data points and the values of which are the predicted values. I will do this by plot a graph of Actual vs Predicted values. y_test y_pred https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/aa7b09468ccc4a4e811b07aa8ba0c397?height=755&embedable=true Saving the Model Now, we need to save our trained model for future use. We will pickle the model using package. joblib def save_model(model): joblib.dump(model, open('model.jlib', 'wb+')) save_model(model) Now, we have saved our model. Let's try out the saved model by loading it as and then make predictions using the saved model. saved_model saved_model = joblib.load('model.jlib') https://embed.deepnote.com/63de8c65-d4af-49f5-ad1b-f6c3cc845557/40d5495aa49d45fd987753bf3d581304/e66df9d951d242f487eea56b83b5d7a8?height=134.1875&embedable=true As we can see above, the saved model gives the same output as the original model. Final Thoughts In this blog, I demonstrated how to build a Linear Regression model to predict the sustainability in sports. We used and to analyze the data and then used to train the model. After Model Evaluation, the last step is Model Deployment. I will demonstrate deploying ML models on using Framework in some other blog. I have already deployed this model, so if you want to try it out you may check out the link below. Pandas Matplotlib Scikit-learn Hugging Face Spaces Gradio Deployed Model - https://jishnupsamal-sports-sustainability.hf.space Originally published at jishnupsamal.ml