Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. In simple terms, it helps us understand how changes in the independent variables are associated with changes in the dependent variable.
In machine learning and data analysis, linear regression holds significant importance. It serves as a foundational tool for various tasks, including prediction, forecasting, and understanding the underlying patterns in data. It provides a simple yet powerful framework for making predictions based on historical data, making it a cornerstone in the field of predictive modeling. Moreover, linear regression also serves as a basis for more complex models and techniques in machine learning, making it a crucial concept to grasp for anyone working with data.
The basic idea behind linear regression is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual values of a dependent variable based on one or more independent variables.
In simple linear regression, there is one dependent variable (Y) and one independent variable (X). The relationship between them is modeled as:
Here,
The goal of linear regression is to find the best values for Beta_0 and Beta_1 that minimize the sum of squared errors (the vertical distance between the actual data points and the predicted values on the line).
In multiple linear regression, there are multiple independent variables. The relationship is extended as:
Here, X_1, X_2, …, X_n represents the independent variables and Beta_1, Beta_2,…,Beta_n are the respective coefficients.
Linear regression allows us to quantify the relationship between the dependent variable and the independent variable(s). It helps us understand how changes in the independent variables are associated with changes in the dependent variable. This understanding is crucial for making predictions, drawing insights, and making informed decisions based on data.
Now, lets start implementing simple linear regression and multiple linear regression using sklearn. In this article, I’ll use the California House Price Dataset from Kaggle.
To demonstrate simple linear regression using the sklearn library, we'll use a California house price prediction dataset from Kaggle.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
data = pd.read_csv('california_housing.csv')
# Explore the dataset
print(data.head())
For this demonstration, we'll use the 'median_income' column as the independent variable (X
) and the 'median_house_value' column as the dependent variable (y
).
X = data[['median_income']]
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE) to evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# -------------- OUTPUT -------------- #
Mean Squared Error: 7091157771.76555
Multiple linear regression extends the concept of simple linear regression to incorporate multiple independent variables. This allows us to model relationships between a dependent variable and two or more independent variables. Let's use the California housing dataset for this demonstration.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
data = pd.read_csv('california_housing.csv')
# Explore the dataset
print(data.head())
For this demonstration, we'll use 'median_income', 'total_rooms', and 'population' as the independent variables (X
) and 'median_house_value' as the dependent variable (y
).
X = data[['median_income', 'total_rooms', 'population']]
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE) to evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# ---------------- OUTPUT -------------- #
Mean Squared Error: 7056856633.523101
I used Mean Squared Error evaluation metric to evaluate both the models, but there are multiple evaluation metrics that you can use for a model like this. Lets now learn about all the different evaluation metrics.
Evaluating the performance of a linear regression model is crucial to understand how well it fits the data and makes accurate predictions. Here are some common metrics used for this purpose:
Choosing the appropriate evaluation metric depends on the specific problem and the importance of different types of errors. For instance, in some cases, minimizing the impact of large errors (MSE/RMSE) might be more critical, while in others, a clear and intuitive interpretation (MAE) might be preferred. R-squared provides a broad overview of the model's fit, but it's important to consider other metrics in conjunction with it.