230 reads

Linear Regression Explained With Sklearn

by SahilSeptember 13th, 2023

Too Long; Didn't Read

Covered basics and math of linear regression, followed by hands-on coding with a Kaggle house price dataset

featured image - Linear Regression Explained With Sklearn

‘math equation’ Image created by HackerNoon AI Image Generator

Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. In simple terms, it helps us understand how changes in the independent variables are associated with changes in the dependent variable.

In machine learning and data analysis, linear regression holds significant importance. It serves as a foundational tool for various tasks, including prediction, forecasting, and understanding the underlying patterns in data. It provides a simple yet powerful framework for making predictions based on historical data, making it a cornerstone in the field of predictive modeling. Moreover, linear regression also serves as a basis for more complex models and techniques in machine learning, making it a crucial concept to grasp for anyone working with data.

What is Linear Regression

The basic idea behind linear regression is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual values of a dependent variable based on one or more independent variables.

In simple linear regression, there is one dependent variable (Y) and one independent variable (X). The relationship between them is modeled as:

Here,

Y represents the dependent variable (the one we want to predict).
X represents the independent variable (the one used to make predictions).
Beta_0 is the intercept (the point where the line crosses the Y-axis).
Beta_1 is the slope (the rate of change of Y with respect to X).
epsilon represents the error term, which accounts for the variability in Y that is not explained by X.

The goal of linear regression is to find the best values for Beta_0 and Beta_1 that minimize the sum of squared errors (the vertical distance between the actual data points and the predicted values on the line).

In multiple linear regression, there are multiple independent variables. The relationship is extended as:

Here, X_1, X_2, …, X_n represents the independent variables and Beta_1, Beta_2,…,Beta_n are the respective coefficients.

Linear regression allows us to quantify the relationship between the dependent variable and the independent variable(s). It helps us understand how changes in the independent variables are associated with changes in the dependent variable. This understanding is crucial for making predictions, drawing insights, and making informed decisions based on data.

Now, lets start implementing simple linear regression and multiple linear regression using sklearn. In this article, I’ll use the California House Price Dataset from Kaggle.

Simple Linear Regression with Sklearn

To demonstrate simple linear regression using the sklearn library, we'll use a California house price prediction dataset from Kaggle.

Importing Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Load the Dataset

data = pd.read_csv('california_housing.csv')

# Explore the dataset
print(data.head())

Prepare the Data

For this demonstration, we'll use the 'median_income' column as the independent variable (X) and the 'median_house_value' column as the dependent variable (y).

X = data[['median_income']]
y = data['median_house_value']

Split the Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Initialize and Train the Model

# Initialize the model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

Make Predictions

# Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)

Evaluate the Model

# Calculate the Mean Squared Error (MSE) to evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# -------------- OUTPUT -------------- #
Mean Squared Error: 7091157771.76555

Multiple Linear Regression with Skearn

Multiple linear regression extends the concept of simple linear regression to incorporate multiple independent variables. This allows us to model relationships between a dependent variable and two or more independent variables. Let's use the California housing dataset for this demonstration.

Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Load Dataset

data = pd.read_csv('california_housing.csv')

# Explore the dataset
print(data.head())

Prepare the Data

For this demonstration, we'll use 'median_income', 'total_rooms', and 'population' as the independent variables (X) and 'median_house_value' as the dependent variable (y).

X = data[['median_income', 'total_rooms', 'population']]
y = data['median_house_value']

Splitting Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Initialize and Train the Model

# Initialize the model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

Make Predictions

# Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)

Evaluate the Model

# Calculate the Mean Squared Error (MSE) to evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# ---------------- OUTPUT -------------- #
Mean Squared Error: 7056856633.523101

I used Mean Squared Error evaluation metric to evaluate both the models, but there are multiple evaluation metrics that you can use for a model like this. Lets now learn about all the different evaluation metrics.

Model Evaluation and Performance Metrics

Evaluating the performance of a linear regression model is crucial to understand how well it fits the data and makes accurate predictions. Here are some common metrics used for this purpose:

Mean Absolute Error (MAE):

Definition: The Mean Absolute Error is the average of the absolute differences between the predicted and actual values. It measures the average magnitude of errors.
Significance: MAE provides a straightforward and easy-to-interpret measure of model performance. It's expressed in the same units as the target variable, making it intuitive to understand.

Mean Squared Error (MSE):

Definition: The Mean Squared Error is the average of the squared differences between the predicted and actual values. It gives more weight to large errors compared to MAE.
Significance: MSE penalizes larger errors more heavily. It is useful when larger errors are considered more problematic (e.g., in situations where outliers have a significant impact on the model's performance).

Root Mean Squared Error (RMSE):

Definition: The Root Mean Squared Error is the square root of the MSE. It provides a measure of the average magnitude of the errors, but in the same units as the dependent variable.
Significance: RMSE is particularly useful because it is in the same units as the target variable. This makes it easy to interpret and compare with the original data.

R-squared (R²):

Definition: R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates a perfect fit.
Significance: R-squared provides an overall indication of how well the model explains the variability in the data. A higher R-squared value indicates a better fit. However, it doesn't tell you about the magnitude or direction of individual errors.

Adjusted R-squared:

Definition: Adjusted R-squared is an adjusted version of R-squared that takes into account the number of independent variables in the model. It penalizes the inclusion of unnecessary variables.
Significance: It provides a more accurate measure of the model's explanatory power when there are multiple independent variables. It helps prevent overfitting by accounting for the complexity of the model.

Interpretation:

MAE: "On average, our model's predictions are off by X units."
MSE/RMSE: "The average squared prediction error of our model is X."
R-squared: "X% of the variance in the dependent variable is explained by the independent variables in our model."

Choosing the appropriate evaluation metric depends on the specific problem and the importance of different types of errors. For instance, in some cases, minimizing the impact of large errors (MSE/RMSE) might be more critical, while in others, a clear and intuitive interpretation (MAE) might be preferred. R-squared provides a broad overview of the model's fit, but it's important to consider other metrics in conjunction with it.